[MUSIC] In this module,
we talked about how to do regression part. We talked about how to use
it to predict house prices. Now, we're going to build together and
pricing notebook using Python to predict house prices for a real dataset,
based on what's called King County data. King County is the county or the region
where the city of Seattle, where Emmy and I live, is located. So, we're going to take some of the data,
it's public record data, and actually build together a regression
notebook to predict house prices. So let's get started. Okay, so here's a blank IPython notebook. And just to start,
what I'm going to do is hide the, first let's change the title,
so we're going to change the title to
predicting house prices. So we just renamed it, and I like to
the view menu and hide the header, and the toolbar, this part on the bottom,
so we have more space on the slide. So we done the hiding. And the first thing we're going to
do is fire up Graph Lab Create, the tool that we're going to use
to run some algorithms in Python. So I'm going to type ask M, and I'm going to say fire up graphlab create. So we do that by typing import graphlab. And that just starts up graphlab create. Now our task today is to
predict house prices. So the first thing that we
are going to do is to load some house sales data. So this is public data, public record, of
house that got sold in the Seattle region. So, I'm going to call
this table data sales, and I'm going to say graphlab .SFrame. Remember we talked about SFrame
as the data structure for representing tabular
data in graphlab create? So it's a really fast hour
of core data structure, and we're going to load up some
house things in there. So this is going to be
called home underscore data, and notice that data is now complete for
you, so that just happened then. So if I just type while this is loading
up and it's firing up GraphLab Create, if I just type sales here,
you'll see what that data looks like. So I'm going to scroll up a little bit
to the top, I just type sales and says, there was an ID for a date with the sale,
the price, number of bedrooms, number of bathrooms, square feet, which
is kind of like that American version of square meters if you're living in other
countries, for the house, feet for the lot of land, number of floors,
in a bunch of the categories. Whether the house has a view or
not, whether it sits on a grade, which means it's on a hill, and
a bunch of other measurements. We've loaded this house data and
it looks pretty cool. The first thing that we're going
to do is use graph lab canvas and do a little bit of visualization. So what I'm going to do is again,
create the cell. And this says exploring the data for
housing. So housing sales. So we're going to do
some data exploration. So we're going to take the sales data and
I'm going to show, so when I type .show, it's going to show
some visualization of that data, and in particular,
what I'm going to do is view, just rather than letting graphlab
do this view, we're going to just do a scatter plot, so we'll see
what a scatter plot is in a second. We'll just type new scatter plot
that relates two variables. On the X axis, we're going to have
the square feet of living space. And in the y axis,
we're going to put the price. So, what that should show us, is the relationship to where square
feet of living space and price. Now, one little trick that I like
to do when I create notebooks is that sometimes you push out
graph lab canvas in a new tub, but also it's kind of fun to just plot
those scatter plots and simple plots inside the notebook itself, so it can
print it off and hand it off to somebody. So the way to do that is I can just
tell graph lab canvas to set its target, not to be the browser which
is the default target, but to be the ipython notebook. I just type canvas.set_target('ipynb'),
ipython notebook, and it's going to plot this scatter
plot on the notebook itself. So if I hit enter here,
what's going to do is take those two axes and
just plot them together. So here we go. On the X axis is the square feet of
the house, and the Y axis is the price. So let's kind of browse this a little bit. So for example, the more square
feet they're big houses, so if I mouse over here you'll see that
this big house had 5,990 square feet, which is pretty big,
that's like 600 square meters, and it was sold for $2.2 million,
which is quite a lot of money. Now, you also notice there's
a nice relationship, that the bigger houses tend to cost more. There's a big blob of houses here, so most houses are between 1,000 and
3,000 square feet. And even at this level,
see this house over here, it's an outlier. So even though it's only 1,910 square
feet, it was sold for $1.5 million. Well, down here there's a house
that is similar 1,700 square feet, that sold for just $149,000. This is a big discrepancy. And here's the biggest
outlier of the dataset. This house has 3,730 square feet and
got sold for $2.5 million. [MUSIC]