1 00:00:00,000 --> 00:00:04,293 [MUSIC] 2 00:00:04,293 --> 00:00:08,213 Now we're gonna discuss an important issue of the influence of what are called high 3 00:00:08,213 --> 00:00:09,400 leverage points. 4 00:00:09,400 --> 00:00:13,540 And these are points that can be considered influential observations. 5 00:00:13,540 --> 00:00:16,900 But to have this discussion, I think it's really useful to just look at some data. 6 00:00:18,760 --> 00:00:23,700 So to start with, let's fire up graphlab and then let's load some data. 7 00:00:23,700 --> 00:00:27,920 And for this, we're gonna load our data into our SFrame. 8 00:00:27,920 --> 00:00:33,100 And I'm gonna assume that you guys are familiar with a lot of what I'm doing here 9 00:00:33,100 --> 00:00:36,790 from the foundations course where we went through pretty slowly, 10 00:00:36,790 --> 00:00:41,540 a lot of the graphlab related code that we're seeing here. 11 00:00:41,540 --> 00:00:45,324 And I wanna emphasize that throughout this course you'll actually learn how to 12 00:00:45,324 --> 00:00:48,768 implement these methods, but for the sake of this demo and other demos in 13 00:00:48,768 --> 00:00:52,326 this course, we're gonna just use graphlab Create to keep the discussion 14 00:00:52,326 --> 00:00:55,746 at a much higher level about the concepts that we're trying to convey. 15 00:00:55,746 --> 00:01:00,713 Okay, so the data set that we're looking at here In this example is 16 00:01:00,713 --> 00:01:04,972 a Philadelphia housing data set, where in particular, 17 00:01:04,972 --> 00:01:08,698 our data set consists of the average house price in 18 00:01:08,698 --> 00:01:13,603 a whole collection of towns in the greater Philadelphia region. 19 00:01:13,603 --> 00:01:18,410 And we also have information about crime rates in each one of these towns. 20 00:01:18,410 --> 00:01:21,230 As well as how far that town is from Center City and 21 00:01:21,230 --> 00:01:23,760 Center City is the downtown region of Philadelphia. 22 00:01:25,640 --> 00:01:27,960 So, let's start analyzing this data. 23 00:01:27,960 --> 00:01:34,214 And to do this, what we're going to start with is just making a scatter plot 24 00:01:34,214 --> 00:01:41,291 of what's the relationship between average house sales prices, and crime rates. 25 00:01:43,234 --> 00:01:48,263 Okay, so here we are, we're gonna do just 26 00:01:48,263 --> 00:01:53,293 a .show command to show a scatter plot of, 27 00:01:53,293 --> 00:01:57,085 on the x axis we have crime rate. 28 00:01:57,085 --> 00:02:00,409 And each one of these little blue circles or cyan, 29 00:02:00,409 --> 00:02:04,060 light blue circles is a different town in our dataset. 30 00:02:04,060 --> 00:02:07,270 And we have a total of 98 different towns. 31 00:02:07,270 --> 00:02:13,420 And on the y axis what we have is the average house value in that town. 32 00:02:13,420 --> 00:02:14,390 Okay. 33 00:02:14,390 --> 00:02:18,700 And so, from this you can see that there's some relationship between 34 00:02:18,700 --> 00:02:21,550 our crime rate and our house sales price. 35 00:02:21,550 --> 00:02:26,150 In particular, we see that for towns that have lower crime rates, 36 00:02:26,150 --> 00:02:29,400 they tend to have higher house values and vice versa. 37 00:02:29,400 --> 00:02:30,970 So that makes a lot of sense. 38 00:02:30,970 --> 00:02:35,270 So let's try and actually fit a relationship between crime rate and 39 00:02:35,270 --> 00:02:36,370 house price. 40 00:02:36,370 --> 00:02:41,960 So we're gonna go through and fit this regression model doing our standard 41 00:02:43,180 --> 00:02:48,720 dot linear regression command, taking out target, or 42 00:02:48,720 --> 00:02:53,470 our output, to be that house price in that region, and 43 00:02:53,470 --> 00:03:00,460 taking our features to just be a single feature, which is crime rate in that town. 44 00:03:00,460 --> 00:03:06,420 Okay, so now what we've done is we've output this to something called 45 00:03:06,420 --> 00:03:13,335 crime underscore model and now let's look at what this fit resulted. 46 00:03:13,335 --> 00:03:19,270 So I just Import our map plot 47 00:03:19,270 --> 00:03:23,740 library to start making some plots here, and 48 00:03:23,740 --> 00:03:27,631 what we're gonna show is we're gonna show a plot of 49 00:03:27,631 --> 00:03:32,260 the observations that we showed before as well as our fitted line. 50 00:03:32,260 --> 00:03:39,670 So this is our fitted simple linear regression model is this green line. 51 00:03:39,670 --> 00:03:45,370 So these are our predictions of house values for each crime 52 00:03:45,370 --> 00:03:50,590 rate going from 0 up to somewhere around, I don't know, 360 or something like this. 53 00:03:52,360 --> 00:03:57,250 So we do see a trend where house value 54 00:03:57,250 --> 00:04:02,120 decreases as crime increases, the slope of this line is negative. 55 00:04:02,120 --> 00:04:06,950 But one thing that we pretty immediately see is there's an observation out here, 56 00:04:06,950 --> 00:04:13,490 there's this blue dot which has extremely high crime rates but 57 00:04:13,490 --> 00:04:20,280 the house value is I mean it's low-ish, but it's not as 58 00:04:20,280 --> 00:04:25,370 low as the house values in other regions that have significantly lower crime rates. 59 00:04:25,370 --> 00:04:28,080 And we see that our line, our fitted line, 60 00:04:28,080 --> 00:04:32,495 is getting pulled towards this observation that's all the way out here. 61 00:04:32,495 --> 00:04:36,310 So it's being, at least it looks like from the picture, 62 00:04:36,310 --> 00:04:39,620 being heavily influenced by this one observation. 63 00:04:39,620 --> 00:04:42,320 And that's really, really far on the x axis.