1 00:00:00,012 --> 00:00:04,212 [MUSIC] 2 00:00:04,212 --> 00:00:08,629 So let's see what happens if we remove this observation. 3 00:00:08,629 --> 00:00:12,169 And this observation here, this is the observation for Center City, 4 00:00:12,169 --> 00:00:13,375 that downtown region. 5 00:00:13,375 --> 00:00:17,253 So, not surprisingly, that's where a lot of crimes happen, but 6 00:00:17,253 --> 00:00:21,630 it's also where there's a mixture of low value and very high value homes. 7 00:00:21,630 --> 00:00:26,980 So on average, the value is higher than one might expect for 8 00:00:26,980 --> 00:00:28,940 the amount of crime that occurs in that region. 9 00:00:30,450 --> 00:00:34,895 Okay, so what we're gonna do now, 10 00:00:34,895 --> 00:00:40,230 is just get down to this line here, 11 00:00:40,230 --> 00:00:44,030 we're gonna simply remove Center City from our data sites. 12 00:00:44,030 --> 00:00:50,000 Okay, and I know that Center City is the town that is zero miles. 13 00:00:50,000 --> 00:00:52,260 If we go back to this column that we discussed before, 14 00:00:52,260 --> 00:00:55,350 it's zero miles to Center City, because it is Center City. 15 00:00:56,420 --> 00:01:00,980 So we're just removing that row of our data and 16 00:01:00,980 --> 00:01:02,550 then we're gonna redo our scatter plot. 17 00:01:04,520 --> 00:01:11,060 If we scroll down we see that what we have is our cloud of points, but 18 00:01:11,060 --> 00:01:17,270 for a much smaller range of crime now that outlying center city has been removed. 19 00:01:18,580 --> 00:01:24,100 Okay, so now what we're gonna do is we're gonna go and 20 00:01:24,100 --> 00:01:27,760 refit our simple regression model. 21 00:01:27,760 --> 00:01:30,350 But on this data set where Center City has been removed. 22 00:01:30,350 --> 00:01:37,712 So I'm calling this Crime Model_NoCC meaning for no Center City observation. 23 00:01:37,712 --> 00:01:43,630 Now let's look at the fit associated with this new model. 24 00:01:43,630 --> 00:01:47,880 Well actually it's the same model, but just on a revised dataset. 25 00:01:47,880 --> 00:01:50,970 And what we see again is this downward trend 26 00:01:50,970 --> 00:01:54,660 with house value with increasing crime rate. 27 00:01:54,660 --> 00:01:58,940 But we see a much better fit to the observations that are remaining 28 00:01:58,940 --> 00:02:00,310 in our dataset. 29 00:02:00,310 --> 00:02:05,430 But to make this a little more explicit, let's actually compare the coefficients, 30 00:02:05,430 --> 00:02:09,520 between the fit that we had when Center City was in our dataset, and 31 00:02:09,520 --> 00:02:11,970 the fit that we got when we removed Center City. 32 00:02:14,850 --> 00:02:19,420 Okay, so here are the coefficients, our intercept, and 33 00:02:19,420 --> 00:02:22,760 our slope when we had Center City. 34 00:02:22,760 --> 00:02:27,500 And here are the coefficients when we remove Center city. 35 00:02:27,500 --> 00:02:30,090 So let's talk about this slope term. 36 00:02:31,220 --> 00:02:36,220 When Center city was in our dataset, we said that the average house value 37 00:02:36,220 --> 00:02:43,600 decreased by an amount of $576 per unit increase in crime rate. 38 00:02:43,600 --> 00:02:45,630 Remember we know how to interpret these coefficients and 39 00:02:45,630 --> 00:02:47,720 that's what I'm doing right now. 40 00:02:47,720 --> 00:02:50,596 In contrast, when I remove Center City, 41 00:02:50,596 --> 00:02:56,102 what is the predicted decrease in crime rate I'm get, I mean sorry, predicted 42 00:02:56,102 --> 00:03:01,463 decrease in house of value that I'm getting per unit increase in crime rate? 43 00:03:01,463 --> 00:03:05,504 Now, just removing one observation, 44 00:03:05,504 --> 00:03:09,796 my predicted decrease is $2,287. 45 00:03:09,796 --> 00:03:11,921 That's significantly different. 46 00:03:11,921 --> 00:03:17,768 So when I'm going and I'm making an interpretation about 47 00:03:17,768 --> 00:03:22,790 how much crime rate affects drops in house value. 48 00:03:22,790 --> 00:03:27,570 I have significantly different interpretations when I include Center City 49 00:03:27,570 --> 00:03:29,280 in the dataset versus removing it. 50 00:03:30,290 --> 00:03:33,730 So now let's just discuss a little bit about why this is, and 51 00:03:33,730 --> 00:03:35,470 this brings us to two points. 52 00:03:35,470 --> 00:03:38,110 I've put a little paragraph of text. 53 00:03:38,110 --> 00:03:39,740 We're gonna share these notebooks with you. 54 00:03:39,740 --> 00:03:43,590 You can go through rerun everything we're doing, do different analysis, and 55 00:03:43,590 --> 00:03:45,820 also read the comments that I've put here. 56 00:03:45,820 --> 00:03:49,630 But let's discuss this idea of what are called High Leverage Points and 57 00:03:49,630 --> 00:03:50,950 Influential Observations. 58 00:03:52,090 --> 00:03:57,868 So, a high leverage point is a point that along the x-axis, 59 00:03:57,868 --> 00:04:01,350 along our input axis is an outlier. 60 00:04:01,350 --> 00:04:05,790 So, it's very extreme value, either extremely large or 61 00:04:05,790 --> 00:04:09,620 extremely small, relative to where we have other observations. 62 00:04:09,620 --> 00:04:14,860 So, if we go back up to the plot that has Center City in it, 63 00:04:14,860 --> 00:04:21,020 we see clearly that the crime rate associated with Center City is very, 64 00:04:21,020 --> 00:04:25,720 very different than the crime rates we see for other towns. 65 00:04:25,720 --> 00:04:29,760 So what that means is that point is a high leverage point, because, if we go back and 66 00:04:29,760 --> 00:04:32,610 think about our closed form solution for 67 00:04:32,610 --> 00:04:38,130 our simple regression model, for estimating the coefficients of this model. 68 00:04:38,130 --> 00:04:41,300 Well if you go and look at those equations you'll see that there's a term that 69 00:04:41,300 --> 00:04:47,610 relates to the center of mass of our X values, so the average X value. 70 00:04:47,610 --> 00:04:51,550 And so including a point that's very far out is gonna 71 00:04:51,550 --> 00:04:55,680 strongly influence where the center of mass of this line is. 72 00:04:55,680 --> 00:04:58,890 So that's gonna dramatically change the fit as well as 73 00:05:00,370 --> 00:05:03,720 another term that depends on the value of this observation, 74 00:05:03,720 --> 00:05:07,770 which is gonna have this line trying to get close to this observation. 75 00:05:07,770 --> 00:05:10,550 Remember, we're trying to minimize residuals on the squares. 76 00:05:10,550 --> 00:05:14,890 So if it ignored it and it just draw a line very steeply going down. 77 00:05:14,890 --> 00:05:18,587 We'd have a very massive residual sum of squares for this point here. 78 00:05:18,587 --> 00:05:21,500 So it's gonna try and hit this point. 79 00:05:22,900 --> 00:05:26,921 And thus, the influence of that point can be very large. 80 00:05:26,921 --> 00:05:30,464 Okay, so this gets us to a point of influential observations. 81 00:05:30,464 --> 00:05:34,471 Now, let's just return to this little text I have here, 82 00:05:34,471 --> 00:05:38,729 where just because an observation is a high leverage point, 83 00:05:38,729 --> 00:05:43,336 meaning that it's outlined, either very small or very large X. 84 00:05:43,336 --> 00:05:47,299 It doesn't mean that it's going to strongly influence the fit, 85 00:05:47,299 --> 00:05:52,140 because if that observation follows the trend of the other data. 86 00:05:52,140 --> 00:05:54,490 Then it might not influence things very much at all. 87 00:05:54,490 --> 00:05:58,750 Removing that observation you might get a very similar fit, had Center City 88 00:05:58,750 --> 00:06:01,580 had a similar kind of trend to what we saw for the other observations. 89 00:06:02,600 --> 00:06:06,690 However, it has the potential to strongly influence the fit 90 00:06:06,690 --> 00:06:07,910 as we've seen in this demo. 91 00:06:09,130 --> 00:06:13,240 So an influential observation is an observation where if you 92 00:06:13,240 --> 00:06:16,400 remove it from the dataset you get a very different fit. 93 00:06:17,720 --> 00:06:21,713 But I also wanna emphasize that points that are not high leverage points. 94 00:06:21,713 --> 00:06:26,558 So points that are actually within our typical X range 95 00:06:26,558 --> 00:06:29,796 can be influential observations. 96 00:06:29,796 --> 00:06:34,361 So in particular you can think of an observation that's very outlined in the Y 97 00:06:34,361 --> 00:06:36,058 direction in our response. 98 00:06:36,058 --> 00:06:41,575 So for example, a town that has an extremely high value relative to what you 99 00:06:41,575 --> 00:06:48,510 might see from other observations, well that can also strongly influence the fit. 100 00:06:48,510 --> 00:06:50,069 But the potential for doing so 101 00:06:50,069 --> 00:06:54,119 is much less when it's in the typical X range if you have dense observations, 102 00:06:54,119 --> 00:06:57,755 because the fit will be controlled by all these other observations. 103 00:06:57,755 --> 00:06:59,770 Whereas if it's an outlying point, 104 00:06:59,770 --> 00:07:03,547 you can just think of the control it has as being much, much greater. 105 00:07:03,547 --> 00:07:07,829 [MUSIC]