1
00:00:00,012 --> 00:00:04,212
[MUSIC]

2
00:00:04,212 --> 00:00:08,629
So let's see what happens if
we remove this observation.

3
00:00:08,629 --> 00:00:12,169
And this observation here,
this is the observation for Center City,

4
00:00:12,169 --> 00:00:13,375
that downtown region.

5
00:00:13,375 --> 00:00:17,253
So, not surprisingly,
that's where a lot of crimes happen, but

6
00:00:17,253 --> 00:00:21,630
it's also where there's a mixture of
low value and very high value homes.

7
00:00:21,630 --> 00:00:26,980
So on average, the value is
higher than one might expect for

8
00:00:26,980 --> 00:00:28,940
the amount of crime that
occurs in that region.

9
00:00:30,450 --> 00:00:34,895
Okay, so what we're gonna do now,

10
00:00:34,895 --> 00:00:40,230
is just get down to this line here,

11
00:00:40,230 --> 00:00:44,030
we're gonna simply remove
Center City from our data sites.

12
00:00:44,030 --> 00:00:50,000
Okay, and I know that Center City
is the town that is zero miles.

13
00:00:50,000 --> 00:00:52,260
If we go back to this column
that we discussed before,

14
00:00:52,260 --> 00:00:55,350
it's zero miles to Center City,
because it is Center City.

15
00:00:56,420 --> 00:01:00,980
So we're just removing
that row of our data and

16
00:01:00,980 --> 00:01:02,550
then we're gonna redo our scatter plot.

17
00:01:04,520 --> 00:01:11,060
If we scroll down we see that what
we have is our cloud of points, but

18
00:01:11,060 --> 00:01:17,270
for a much smaller range of crime now that
outlying center city has been removed.

19
00:01:18,580 --> 00:01:24,100
Okay, so now what we're gonna
do is we're gonna go and

20
00:01:24,100 --> 00:01:27,760
refit our simple regression model.

21
00:01:27,760 --> 00:01:30,350
But on this data set where
Center City has been removed.

22
00:01:30,350 --> 00:01:37,712
So I'm calling this Crime Model_NoCC
meaning for no Center City observation.

23
00:01:37,712 --> 00:01:43,630
Now let's look at the fit
associated with this new model.

24
00:01:43,630 --> 00:01:47,880
Well actually it's the same model,
but just on a revised dataset.

25
00:01:47,880 --> 00:01:50,970
And what we see again
is this downward trend

26
00:01:50,970 --> 00:01:54,660
with house value with
increasing crime rate.

27
00:01:54,660 --> 00:01:58,940
But we see a much better fit to
the observations that are remaining

28
00:01:58,940 --> 00:02:00,310
in our dataset.

29
00:02:00,310 --> 00:02:05,430
But to make this a little more explicit,
let's actually compare the coefficients,

30
00:02:05,430 --> 00:02:09,520
between the fit that we had when
Center City was in our dataset, and

31
00:02:09,520 --> 00:02:11,970
the fit that we got when
we removed Center City.

32
00:02:14,850 --> 00:02:19,420
Okay, so here are the coefficients,
our intercept, and

33
00:02:19,420 --> 00:02:22,760
our slope when we had Center City.

34
00:02:22,760 --> 00:02:27,500
And here are the coefficients
when we remove Center city.

35
00:02:27,500 --> 00:02:30,090
So let's talk about this slope term.

36
00:02:31,220 --> 00:02:36,220
When Center city was in our dataset,
we said that the average house value

37
00:02:36,220 --> 00:02:43,600
decreased by an amount of $576
per unit increase in crime rate.

38
00:02:43,600 --> 00:02:45,630
Remember we know how to
interpret these coefficients and

39
00:02:45,630 --> 00:02:47,720
that's what I'm doing right now.

40
00:02:47,720 --> 00:02:50,596
In contrast, when I remove Center City,

41
00:02:50,596 --> 00:02:56,102
what is the predicted decrease in crime
rate I'm get, I mean sorry, predicted

42
00:02:56,102 --> 00:03:01,463
decrease in house of value that I'm
getting per unit increase in crime rate?

43
00:03:01,463 --> 00:03:05,504
Now, just removing one observation,

44
00:03:05,504 --> 00:03:09,796
my predicted decrease is $2,287.

45
00:03:09,796 --> 00:03:11,921
That's significantly different.

46
00:03:11,921 --> 00:03:17,768
So when I'm going and
I'm making an interpretation about

47
00:03:17,768 --> 00:03:22,790
how much crime rate affects
drops in house value.

48
00:03:22,790 --> 00:03:27,570
I have significantly different
interpretations when I include Center City

49
00:03:27,570 --> 00:03:29,280
in the dataset versus removing it.

50
00:03:30,290 --> 00:03:33,730
So now let's just discuss a little
bit about why this is, and

51
00:03:33,730 --> 00:03:35,470
this brings us to two points.

52
00:03:35,470 --> 00:03:38,110
I've put a little paragraph of text.

53
00:03:38,110 --> 00:03:39,740
We're gonna share these
notebooks with you.

54
00:03:39,740 --> 00:03:43,590
You can go through rerun everything
we're doing, do different analysis, and

55
00:03:43,590 --> 00:03:45,820
also read the comments that I've put here.

56
00:03:45,820 --> 00:03:49,630
But let's discuss this idea of what
are called High Leverage Points and

57
00:03:49,630 --> 00:03:50,950
Influential Observations.

58
00:03:52,090 --> 00:03:57,868
So, a high leverage point is
a point that along the x-axis,

59
00:03:57,868 --> 00:04:01,350
along our input axis is an outlier.

60
00:04:01,350 --> 00:04:05,790
So, it's very extreme value,
either extremely large or

61
00:04:05,790 --> 00:04:09,620
extremely small, relative to
where we have other observations.

62
00:04:09,620 --> 00:04:14,860
So, if we go back up to the plot
that has Center City in it,

63
00:04:14,860 --> 00:04:21,020
we see clearly that the crime rate
associated with Center City is very,

64
00:04:21,020 --> 00:04:25,720
very different than the crime
rates we see for other towns.

65
00:04:25,720 --> 00:04:29,760
So what that means is that point is a high
leverage point, because, if we go back and

66
00:04:29,760 --> 00:04:32,610
think about our closed form solution for

67
00:04:32,610 --> 00:04:38,130
our simple regression model, for
estimating the coefficients of this model.

68
00:04:38,130 --> 00:04:41,300
Well if you go and look at those equations
you'll see that there's a term that

69
00:04:41,300 --> 00:04:47,610
relates to the center of mass of our
X values, so the average X value.

70
00:04:47,610 --> 00:04:51,550
And so including a point
that's very far out is gonna

71
00:04:51,550 --> 00:04:55,680
strongly influence where the center
of mass of this line is.

72
00:04:55,680 --> 00:04:58,890
So that's gonna dramatically
change the fit as well as

73
00:05:00,370 --> 00:05:03,720
another term that depends on
the value of this observation,

74
00:05:03,720 --> 00:05:07,770
which is gonna have this line trying
to get close to this observation.

75
00:05:07,770 --> 00:05:10,550
Remember, we're trying to minimize
residuals on the squares.

76
00:05:10,550 --> 00:05:14,890
So if it ignored it and it just draw
a line very steeply going down.

77
00:05:14,890 --> 00:05:18,587
We'd have a very massive residual
sum of squares for this point here.

78
00:05:18,587 --> 00:05:21,500
So it's gonna try and hit this point.

79
00:05:22,900 --> 00:05:26,921
And thus, the influence of
that point can be very large.

80
00:05:26,921 --> 00:05:30,464
Okay, so this gets us to a point
of influential observations.

81
00:05:30,464 --> 00:05:34,471
Now, let's just return to
this little text I have here,

82
00:05:34,471 --> 00:05:38,729
where just because an observation
is a high leverage point,

83
00:05:38,729 --> 00:05:43,336
meaning that it's outlined,
either very small or very large X.

84
00:05:43,336 --> 00:05:47,299
It doesn't mean that it's going
to strongly influence the fit,

85
00:05:47,299 --> 00:05:52,140
because if that observation follows
the trend of the other data.

86
00:05:52,140 --> 00:05:54,490
Then it might not influence
things very much at all.

87
00:05:54,490 --> 00:05:58,750
Removing that observation you might get
a very similar fit, had Center City

88
00:05:58,750 --> 00:06:01,580
had a similar kind of trend to what
we saw for the other observations.

89
00:06:02,600 --> 00:06:06,690
However, it has the potential
to strongly influence the fit

90
00:06:06,690 --> 00:06:07,910
as we've seen in this demo.

91
00:06:09,130 --> 00:06:13,240
So an influential observation
is an observation where if you

92
00:06:13,240 --> 00:06:16,400
remove it from the dataset
you get a very different fit.

93
00:06:17,720 --> 00:06:21,713
But I also wanna emphasize that points
that are not high leverage points.

94
00:06:21,713 --> 00:06:26,558
So points that are actually
within our typical X range

95
00:06:26,558 --> 00:06:29,796
can be influential observations.

96
00:06:29,796 --> 00:06:34,361
So in particular you can think of an
observation that's very outlined in the Y

97
00:06:34,361 --> 00:06:36,058
direction in our response.

98
00:06:36,058 --> 00:06:41,575
So for example, a town that has an
extremely high value relative to what you

99
00:06:41,575 --> 00:06:48,510
might see from other observations, well
that can also strongly influence the fit.

100
00:06:48,510 --> 00:06:50,069
But the potential for doing so

101
00:06:50,069 --> 00:06:54,119
is much less when it's in the typical X
range if you have dense observations,

102
00:06:54,119 --> 00:06:57,755
because the fit will be controlled
by all these other observations.

103
00:06:57,755 --> 00:06:59,770
Whereas if it's an outlying point,

104
00:06:59,770 --> 00:07:03,547
you can just think of the control
it has as being much, much greater.

105
00:07:03,547 --> 00:07:07,829
[MUSIC]