1 00:00:00,211 --> 00:00:04,087 [MUSIC] 2 00:00:04,087 --> 00:00:08,847 Okay, so that was just one example of how we can think about looking at features of 3 00:00:08,847 --> 00:00:12,080 a single input, but there are lots of other examples. 4 00:00:12,080 --> 00:00:17,630 And let's just go through one application in particular where this is very useful, 5 00:00:17,630 --> 00:00:20,400 and that's in detrending time series. 6 00:00:20,400 --> 00:00:23,750 So here what I'm showing, is I'm showing house sales, which are these gray dots, 7 00:00:23,750 --> 00:00:27,870 and there is a whole bunch of them, this is a real data set. 8 00:00:27,870 --> 00:00:31,610 So lots and lots of house sales over time. 9 00:00:31,610 --> 00:00:35,110 So instead of plotting house sales versus square feet, 10 00:00:35,110 --> 00:00:40,200 here we're looking at the trends in house value over time. 11 00:00:40,200 --> 00:00:44,470 And this plot in particular is for the Seattle metropolitan region and 12 00:00:44,470 --> 00:00:49,050 this is some data that you guys have been playing around with 13 00:00:49,050 --> 00:00:50,340 throughout this specialization. 14 00:00:51,730 --> 00:00:57,450 And what this black curve shows here is the average house value versus time. 15 00:00:58,880 --> 00:01:02,170 So just to be very specific our observation or 16 00:01:02,170 --> 00:01:07,769 our output YI is the sales price of the ith, and 17 00:01:07,769 --> 00:01:14,260 our input is going to be the time of that house sale. 18 00:01:14,260 --> 00:01:18,120 So we're going to denote that by T sub i for the ith house. 19 00:01:18,120 --> 00:01:21,367 And the time is recorded monthly because house 20 00:01:21,367 --> 00:01:25,336 sales are recorded monthly at least in the US. 21 00:01:25,336 --> 00:01:29,790 And one thing that we see is that on average, 22 00:01:29,790 --> 00:01:34,160 the value of houses tends to increase with time. 23 00:01:34,160 --> 00:01:37,660 So that's one effect that we probably want to capture. 24 00:01:37,660 --> 00:01:41,710 But then there's another more subtle effect that might be hard to see from this 25 00:01:41,710 --> 00:01:47,360 plot, but it's the fact that most houses are listed for 26 00:01:47,360 --> 00:01:52,040 sale in the summer, that's the common housing season in the US. 27 00:01:52,040 --> 00:01:53,860 And the good houses, they go really quickly. 28 00:01:55,190 --> 00:01:59,060 In contrast, in for example November, December, 29 00:01:59,060 --> 00:02:03,290 especially here in rainy Seattle, very few houses are listed for sale. 30 00:02:03,290 --> 00:02:07,150 Very few people are going out shopping for houses in the rain. 31 00:02:08,480 --> 00:02:12,460 So what ends up happening is any transactions that you see during these 32 00:02:12,460 --> 00:02:15,390 months are really from leftover inventory 33 00:02:15,390 --> 00:02:19,810 that was sitting around from the summer and just didn't sell in the summer because 34 00:02:19,810 --> 00:02:23,520 they weren't the best houses during that really competitive period. 35 00:02:23,520 --> 00:02:25,860 But if people are desperate and need to buy a house in November, 36 00:02:25,860 --> 00:02:28,190 December they're left with whatever inventory is there. 37 00:02:29,850 --> 00:02:33,980 Or there's some other special circumstance for why that house sale is going on. 38 00:02:33,980 --> 00:02:37,650 And so the result of this is the fact that you tend to see 39 00:02:37,650 --> 00:02:43,190 higher prices in the summer and lower prices in some of the off months. 40 00:02:43,190 --> 00:02:47,340 And so what that means is what there's, what's called seasonality, okay. 41 00:02:47,340 --> 00:02:52,050 Seasonality is the effect where over some period of time. 42 00:02:52,050 --> 00:02:54,950 Which in this case is over the course of months. 43 00:02:55,980 --> 00:03:00,630 We see an effect where there's repeated pattern of prices increasing, 44 00:03:00,630 --> 00:03:02,670 decreasing, increasing, decreasing. 45 00:03:03,840 --> 00:03:05,550 So this is something that'd we like to model. 46 00:03:05,550 --> 00:03:09,440 And the way in which we're gonna model this is as follows. 47 00:03:09,440 --> 00:03:14,300 We're gonna assume that our ith house sales, the price of that house sale 48 00:03:15,950 --> 00:03:19,130 is comprised of the following different components. 49 00:03:19,130 --> 00:03:23,900 There's one component which models just this increasing trend over time and for 50 00:03:23,900 --> 00:03:29,570 the sake of this slide we're just assuming a very simple linear trend so this part 51 00:03:29,570 --> 00:03:35,020 is just our simple linear regression model where our input Is this time index, ti. 52 00:03:36,490 --> 00:03:40,670 Then though we're going to add some more features to this model and 53 00:03:40,670 --> 00:03:44,310 the feature that we're going to add is this sinusoidal component which is 54 00:03:44,310 --> 00:03:48,730 capturing the seasonality this fluctuation of increase prices in the summer, 55 00:03:48,730 --> 00:03:50,960 decrease in the off seasons. 56 00:03:53,330 --> 00:03:57,890 And what we want is we want this sinusoid to reset every year 57 00:03:57,890 --> 00:04:02,060 because we see this pattern repeated again and again every year in our data. 58 00:04:02,060 --> 00:04:05,912 So we're going to chose the period of the sinusoid, and 59 00:04:05,912 --> 00:04:12,020 if you don't remember you're trigonometry that's okay. 60 00:04:12,020 --> 00:04:13,850 Just look at this picture of the sinusoid. 61 00:04:13,850 --> 00:04:16,090 That will be sufficient for what I'm talking about. 62 00:04:16,090 --> 00:04:20,980 But idea is that it resets every 12 months and so 63 00:04:20,980 --> 00:04:24,230 that's captured by this two pi t over 12 term. 64 00:04:26,570 --> 00:04:30,860 And then the issues though is that in general in this case I've talked about 65 00:04:30,860 --> 00:04:33,510 the fact that prices tend to increase in the summer months and 66 00:04:33,510 --> 00:04:36,060 decrease In off season months, but 67 00:04:36,060 --> 00:04:40,250 in general you don't really know where that seasonality trend occurs. 68 00:04:40,250 --> 00:04:43,010 So there's some phase, some unknown phase to this process. 69 00:04:43,010 --> 00:04:45,530 We can think of that just as a shift, and 70 00:04:45,530 --> 00:04:49,350 that's represented by that phi parameter there. 71 00:04:49,350 --> 00:04:53,500 I've marked in blue because that's the color of our parameters. 72 00:04:53,500 --> 00:04:57,560 And just to animate this here, what I'm saying is we don't know where 73 00:04:57,560 --> 00:05:00,820 this sinusoidal kind of trend is appearing in our data, 74 00:05:02,540 --> 00:05:06,760 whether the peaks occur in June or January or something like this. 75 00:05:06,760 --> 00:05:12,270 Okay, but now we have an issue because one of our parameters, 76 00:05:12,270 --> 00:05:17,670 this phi term appears within this function of our input, ti. 77 00:05:19,470 --> 00:05:23,650 So, we haven't yet talked about this, but what this means is that we're, 78 00:05:23,650 --> 00:05:30,680 in this form, it's no longer just a simple linear regression where the parameters or 79 00:05:30,680 --> 00:05:36,220 weights of our model are just multiplying the inputs or the functions of the input. 80 00:05:36,220 --> 00:05:37,530 So it looks more complicated. 81 00:05:38,630 --> 00:05:41,140 But there's a nice trick we can do here, and 82 00:05:41,140 --> 00:05:43,970 that again is going back to some trigonometry. 83 00:05:43,970 --> 00:05:46,550 Specifically the following trigonometric identity. 84 00:05:46,550 --> 00:05:52,500 Which says that if you take sine of A-B, then that's equivalent 85 00:05:52,500 --> 00:05:57,486 to sine(a)cosine(b)- cosine(a)sine(b). 86 00:05:57,486 --> 00:06:02,240 Hopefully I got that ordering correct. 87 00:06:02,240 --> 00:06:06,100 And if you apply that identity specifically to the case we have here what 88 00:06:06,100 --> 00:06:08,750 you see is the following where 89 00:06:08,750 --> 00:06:13,420 this phi parameter now what we have are two multiplicative terms. 90 00:06:13,420 --> 00:06:18,126 We have a cosine phi and a sine phi or really a negative sine phi 91 00:06:18,126 --> 00:06:23,690 multiplying the functions of our input Ti, which are shown in orange. 92 00:06:23,690 --> 00:06:26,532 And so we can think of cosine of phi and 93 00:06:26,532 --> 00:06:30,670 minus sine phi as just some W parameters in our model. 94 00:06:33,040 --> 00:06:34,650 And that's summarized right here. 95 00:06:34,650 --> 00:06:39,700 So an equivalent way to represent the model that we had on this slide here is as 96 00:06:39,700 --> 00:06:43,920 follows, where we have again this linear term and then this sinusoidal component 97 00:06:43,920 --> 00:06:50,030 we're breaking up into a sine and cosine term with these linear multipliers, 98 00:06:50,030 --> 00:06:54,430 W2 and W3, to account for this unknown shift or phase to this function. 99 00:06:56,000 --> 00:07:00,130 So, to make this very concrete, again we're in a featurized 100 00:07:00,130 --> 00:07:04,950 situation where the first feature of our model is just that constant feature. 101 00:07:04,950 --> 00:07:09,800 The second feature is just a linear feature, t itself, our input. 102 00:07:09,800 --> 00:07:13,340 But the third feature and fourth feature are these sine and 103 00:07:13,340 --> 00:07:16,990 cosine functions of our input, t. 104 00:07:18,170 --> 00:07:21,370 Okay, so let's apply this model to our housing data. 105 00:07:21,370 --> 00:07:24,280 And here in this plot we've done just that, we're fitting a polynomial 106 00:07:24,280 --> 00:07:29,390 trend to capture this increase in prices over time as well as the sinusoidal 107 00:07:29,390 --> 00:07:34,780 seasonal component to capture these fluctuations of prices with the season. 108 00:07:36,200 --> 00:07:41,530 And so that's what's shown in this dark blue line here and in particular 109 00:07:41,530 --> 00:07:45,090 in this case, instead of a simple linear trend, we fit a 5th order polynomial. 110 00:07:45,090 --> 00:07:48,740 That's why we get a little bit more interesting of a shape over time. 111 00:07:48,740 --> 00:07:52,130 And to see the effect of this sine cosine basis, 112 00:07:52,130 --> 00:07:55,396 these features, let's zoom into this plot. 113 00:07:55,396 --> 00:07:59,400 So now I've just zoomed in on a little chunk of this data and 114 00:07:59,400 --> 00:08:04,551 you really see that sine and cosine having an effect with prices going up and 115 00:08:04,551 --> 00:08:08,921 down over the course of seasons across these different years. 116 00:08:08,921 --> 00:08:12,979 [MUSIC]