1 00:00:00,218 --> 00:00:03,910 [MUSIC] 2 00:00:03,910 --> 00:00:07,360 In this module, we talked about how to do regression part. 3 00:00:07,360 --> 00:00:09,374 We talked about how to use it to predict house prices. 4 00:00:09,374 --> 00:00:16,160 Now, we're going to build together and pricing notebook using Python to predict 5 00:00:16,160 --> 00:00:21,180 house prices for a real dataset, based on what's called King County data. 6 00:00:21,180 --> 00:00:26,104 King County is the county or the region where the city of Seattle, where Emmy and 7 00:00:26,104 --> 00:00:27,448 I live, is located. 8 00:00:27,448 --> 00:00:31,040 So, we're going to take some of the data, it's public record data, and 9 00:00:31,040 --> 00:00:35,400 actually build together a regression notebook to predict house prices. 10 00:00:35,400 --> 00:00:36,650 So let's get started. 11 00:00:36,650 --> 00:00:39,420 Okay, so here's a blank IPython notebook. 12 00:00:40,630 --> 00:00:46,440 And just to start, what I'm going to do is hide the, 13 00:00:46,440 --> 00:00:51,990 first let's change the title, so we're going to 14 00:00:51,990 --> 00:00:57,183 change the title to predicting house prices. 15 00:00:57,183 --> 00:01:04,080 So we just renamed it, and I like to the view menu and hide the header, 16 00:01:04,080 --> 00:01:08,290 and the toolbar, this part on the bottom, so we have more space on the slide. 17 00:01:08,290 --> 00:01:09,741 So we done the hiding. 18 00:01:09,741 --> 00:01:15,045 And the first thing we're going to do is fire up Graph Lab Create, 19 00:01:15,045 --> 00:01:20,453 the tool that we're going to use to run some algorithms in Python. 20 00:01:20,453 --> 00:01:24,639 So I'm going to type ask M, and 21 00:01:24,639 --> 00:01:30,442 I'm going to say fire up graphlab create. 22 00:01:30,442 --> 00:01:37,160 So we do that by typing import graphlab. 23 00:01:38,750 --> 00:01:41,550 And that just starts up graphlab create. 24 00:01:41,550 --> 00:01:46,150 Now our task today is to predict house prices. 25 00:01:46,150 --> 00:01:52,303 So the first thing that we are going to do is to load some 26 00:01:52,303 --> 00:01:58,630 house sales data. 27 00:01:58,630 --> 00:02:03,850 So this is public data, public record, of house that got sold in the Seattle region. 28 00:02:05,510 --> 00:02:10,715 So, I'm going to call this table data sales, 29 00:02:10,715 --> 00:02:15,420 and I'm going to say graphlab .SFrame. 30 00:02:15,420 --> 00:02:19,730 Remember we talked about SFrame as the data structure for 31 00:02:19,730 --> 00:02:22,670 representing tabular data in graphlab create? 32 00:02:22,670 --> 00:02:26,280 So it's a really fast hour of core data structure, and 33 00:02:26,280 --> 00:02:29,720 we're going to load up some house things in there. 34 00:02:29,720 --> 00:02:36,100 So this is going to be called home underscore data, 35 00:02:36,100 --> 00:02:41,560 and notice that data is now complete for you, so that just happened then. 36 00:02:41,560 --> 00:02:47,650 So if I just type while this is loading up and it's firing up GraphLab Create, 37 00:02:47,650 --> 00:02:52,350 if I just type sales here, you'll see what that data looks like. 38 00:02:52,350 --> 00:02:57,840 So I'm going to scroll up a little bit to the top, I just type sales and says, 39 00:02:57,840 --> 00:03:02,530 there was an ID for a date with the sale, the price, number of bedrooms, 40 00:03:02,530 --> 00:03:07,170 number of bathrooms, square feet, which is kind of like that American version of 41 00:03:07,170 --> 00:03:13,120 square meters if you're living in other countries, for the house, feet for 42 00:03:13,120 --> 00:03:19,430 the lot of land, number of floors, in a bunch of the categories. 43 00:03:19,430 --> 00:03:25,370 Whether the house has a view or not, whether it sits on a grade, 44 00:03:25,370 --> 00:03:31,050 which means it's on a hill, and a bunch of other measurements. 45 00:03:31,050 --> 00:03:34,580 We've loaded this house data and it looks pretty cool. 46 00:03:34,580 --> 00:03:38,042 The first thing that we're going to do is use graph lab canvas and 47 00:03:38,042 --> 00:03:39,910 do a little bit of visualization. 48 00:03:42,070 --> 00:03:48,861 So what I'm going to do is again, create the cell. 49 00:03:48,861 --> 00:03:57,429 And this says exploring the data for housing. 50 00:04:00,029 --> 00:04:01,252 So housing sales. 51 00:04:02,750 --> 00:04:04,760 So we're going to do some data exploration. 52 00:04:04,760 --> 00:04:09,886 So we're going to take the sales data and I'm going to show, so 53 00:04:09,886 --> 00:04:15,885 when I type .show, it's going to show some visualization of that data, 54 00:04:15,885 --> 00:04:19,947 and in particular, what I'm going to do is view, 55 00:04:19,947 --> 00:04:24,980 just rather than letting graphlab do this view, we're going 56 00:04:24,980 --> 00:04:31,040 to just do a scatter plot, so we'll see what a scatter plot is in a second. 57 00:04:31,040 --> 00:04:36,100 We'll just type new scatter plot that relates two variables. 58 00:04:36,100 --> 00:04:42,810 On the X axis, we're going to have the square feet of living space. 59 00:04:44,460 --> 00:04:47,520 And in the y axis, we're going to put the price. 60 00:04:48,960 --> 00:04:50,863 So, what that should show us, 61 00:04:50,863 --> 00:04:55,051 is the relationship to where square feet of living space and price. 62 00:04:55,051 --> 00:04:59,635 Now, one little trick that I like to do when I create notebooks 63 00:04:59,635 --> 00:05:04,305 is that sometimes you push out graph lab canvas in a new tub, but 64 00:05:04,305 --> 00:05:09,508 also it's kind of fun to just plot those scatter plots and simple plots 65 00:05:09,508 --> 00:05:15,364 inside the notebook itself, so it can print it off and hand it off to somebody. 66 00:05:15,364 --> 00:05:22,170 So the way to do that is I can just tell graph lab canvas to set its target, 67 00:05:22,170 --> 00:05:26,936 not to be the browser which is the default target, 68 00:05:26,936 --> 00:05:30,030 but to be the ipython notebook. 69 00:05:30,030 --> 00:05:35,360 I just type canvas.set_target('ipynb'), ipython notebook, and 70 00:05:35,360 --> 00:05:40,120 it's going to plot this scatter plot on the notebook itself. 71 00:05:40,120 --> 00:05:44,500 So if I hit enter here, what's going to do is take 72 00:05:46,140 --> 00:05:49,360 those two axes and just plot them together. 73 00:05:49,360 --> 00:05:49,990 So here we go. 74 00:05:49,990 --> 00:05:56,750 On the X axis is the square feet of the house, and the Y axis is the price. 75 00:05:56,750 --> 00:05:59,810 So let's kind of browse this a little bit. 76 00:06:01,590 --> 00:06:06,200 So for example, the more square feet they're big houses, so 77 00:06:06,200 --> 00:06:11,230 if I mouse over here you'll see that this big house had 5,990 square feet, 78 00:06:11,230 --> 00:06:16,810 which is pretty big, that's like 600 square meters, 79 00:06:16,810 --> 00:06:21,310 and it was sold for $2.2 million, which is quite a lot of money. 80 00:06:21,310 --> 00:06:25,210 Now, you also notice there's a nice relationship, 81 00:06:25,210 --> 00:06:28,140 that the bigger houses tend to cost more. 82 00:06:28,140 --> 00:06:30,000 There's a big blob of houses here, so 83 00:06:30,000 --> 00:06:33,790 most houses are between 1,000 and 3,000 square feet. 84 00:06:33,790 --> 00:06:39,280 And even at this level, see this house over here, it's an outlier. 85 00:06:39,280 --> 00:06:45,657 So even though it's only 1,910 square feet, it was sold for $1.5 million. 86 00:06:45,657 --> 00:06:50,917 Well, down here there's a house that is similar 1,700 square feet, 87 00:06:50,917 --> 00:06:53,682 that sold for just $149,000. 88 00:06:53,682 --> 00:06:55,890 This is a big discrepancy. 89 00:06:55,890 --> 00:06:59,010 And here's the biggest outlier of the dataset. 90 00:06:59,010 --> 00:07:04,179 This house has 3,730 square feet and got sold for $2.5 million. 91 00:07:07,385 --> 00:07:11,239 [MUSIC]