1 00:00:04,110 --> 00:00:07,007 Now that we know what Gaussian process are, 2 00:00:07,007 --> 00:00:10,620 let's see how they can be applied to machine learning. 3 00:00:10,620 --> 00:00:13,020 So you have some points x1, 4 00:00:13,020 --> 00:00:14,365 x2 so an xn. 5 00:00:14,365 --> 00:00:15,950 You know the values for them. 6 00:00:15,950 --> 00:00:19,635 Those are f(x1), f(x2), f(xn). 7 00:00:19,635 --> 00:00:21,690 And normally what you'd like to do is to 8 00:00:21,690 --> 00:00:25,900 predict the value of the function at new point x. 9 00:00:25,900 --> 00:00:31,255 For Gaussian process, we will go in a different direction. 10 00:00:31,255 --> 00:00:34,041 That is, we will try to predict the full posterior 11 00:00:34,041 --> 00:00:37,645 over the f(x) given all our data points. 12 00:00:37,645 --> 00:00:41,010 So we would like to estimate the probability of f(x) with 13 00:00:41,010 --> 00:00:45,220 prediction at new point given all previous points. 14 00:00:45,220 --> 00:00:49,170 This will allow us to compute the mean for example and 15 00:00:49,170 --> 00:00:52,470 also to compute the confidence intervals at each point. 16 00:00:52,470 --> 00:00:58,125 This will allow us to estimate uncertainty of our predictions. 17 00:00:58,125 --> 00:01:01,020 So how do we predict this? 18 00:01:01,020 --> 00:01:03,350 Here we have our desired value, 19 00:01:03,350 --> 00:01:10,290 the probability of the prediction given our points is equals to the ratio between 20 00:01:10,290 --> 00:01:14,010 the joint probability over all points including the 21 00:01:14,010 --> 00:01:19,060 new one over the joint probability over our data plans. 22 00:01:19,060 --> 00:01:20,835 Those two are normals. 23 00:01:20,835 --> 00:01:30,935 The one in the numerator would be normal with mean zero and the covariance matrix C2 24 00:01:30,935 --> 00:01:35,830 and in denominator will have the normal with mean zero 25 00:01:35,830 --> 00:01:38,210 and covariance matrix C. So 26 00:01:38,210 --> 00:01:41,410 the current covariance matrix C would have the following form. 27 00:01:41,410 --> 00:01:45,080 On the diagonal we'll have k(0). 28 00:01:45,080 --> 00:01:51,470 This is the variance of the random process and on the off diagonal elements we'll have 29 00:01:51,470 --> 00:01:56,150 the covariance of the kernel function 30 00:01:56,150 --> 00:02:00,920 actually of the difference between the two corresponding points. 31 00:02:00,920 --> 00:02:03,655 The C tilt would look like this. 32 00:02:03,655 --> 00:02:07,210 It would be a matrix which has four blocks. 33 00:02:07,210 --> 00:02:09,400 On the first block is K(0). 34 00:02:09,400 --> 00:02:12,840 That is the covariance between f(x) with itself. 35 00:02:12,840 --> 00:02:16,670 We'll have on the right lower part 36 00:02:16,670 --> 00:02:19,850 of the covariance matrix C corresponding to the covariance of 37 00:02:19,850 --> 00:02:22,820 the data points and we will also have the covariance between 38 00:02:22,820 --> 00:02:27,150 the new data points and the data points that we had before. 39 00:02:27,150 --> 00:02:29,100 So this would be the vector K, 40 00:02:29,100 --> 00:02:33,335 its elements would be k(x) the new point minus the old points 41 00:02:33,335 --> 00:02:39,510 x1 and the second position will have k(x-x2) and so on. 42 00:02:39,510 --> 00:02:45,050 And so finally, we'll have the normal distribution of f(x) 43 00:02:45,050 --> 00:02:51,620 given mu as mean and sigma square as variance. 44 00:02:51,620 --> 00:02:57,415 Do you remember why we have the normal distribution as the posterior? 45 00:02:57,415 --> 00:03:02,630 Well actually, this happens since the ratio between two Gaussians is also a Gaussian. 46 00:03:02,630 --> 00:03:08,400 We have parabola under the x points when we divide two Gaussians, 47 00:03:08,400 --> 00:03:15,285 we have the sum or the difference of two parabolas and this is actually a parabola again. 48 00:03:15,285 --> 00:03:18,115 So the posterior again will be normal. 49 00:03:18,115 --> 00:03:21,420 We've given mean and the variance. 50 00:03:21,420 --> 00:03:25,275 One can derive the formulas for them and those would look like follows. 51 00:03:25,275 --> 00:03:32,025 It would be k transposed C inversed f where f is the vector of the values. 52 00:03:32,025 --> 00:03:35,470 It's like f(x1), f(x2) and so on. 53 00:03:35,470 --> 00:03:38,780 The variance would be k(0), 54 00:03:38,780 --> 00:03:43,400 the initial variance minus k transposed C inversed k and 55 00:03:43,400 --> 00:03:45,025 these trends would show us 56 00:03:45,025 --> 00:03:49,725 how much the variance decreased after we observed the new points. 57 00:03:49,725 --> 00:03:53,415 So this is how the posterior distribution would look like. 58 00:03:53,415 --> 00:03:58,795 So notice here that the variance at the data points is zero. 59 00:03:58,795 --> 00:04:00,310 So since we observed their values, 60 00:04:00,310 --> 00:04:06,715 we can surely say that the value is just f(x1) for example. 61 00:04:06,715 --> 00:04:09,075 And as we move away from the points, 62 00:04:09,075 --> 00:04:15,005 the variance then starts increasing and we are really far from the data points, 63 00:04:15,005 --> 00:04:18,770 the process would be simply stationary. 64 00:04:18,770 --> 00:04:24,705 So the mean will be zero and the variance would be equal to the initial variance. 65 00:04:24,705 --> 00:04:27,285 This is k(0). 66 00:04:27,285 --> 00:04:36,320 So, we should actually preprocess our data to make it like stationary in the prior. 67 00:04:36,320 --> 00:04:42,135 We want our predictions to be stationary when we go away from the data points. 68 00:04:42,135 --> 00:04:44,940 So we would like to expect that the mean value is 69 00:04:44,940 --> 00:04:49,335 zero and the variance would be kept zero. 70 00:04:49,335 --> 00:04:51,430 To make this true, 71 00:04:51,430 --> 00:04:58,190 we should remove the trend and seasonality and also subtract mean and normalize. 72 00:04:58,190 --> 00:05:01,485 And so after training Gaussian process, 73 00:05:01,485 --> 00:05:03,625 we'll have some functioning list. 74 00:05:03,625 --> 00:05:07,350 Should also remember to invert all those transformations 75 00:05:07,350 --> 00:05:12,480 when you predict for a new point. 76 00:05:12,480 --> 00:05:15,290 That is you have to denormalize, 77 00:05:15,290 --> 00:05:17,115 you have to add mean, 78 00:05:17,115 --> 00:05:22,270 add trend and seasonality and output your prediction.