1 00:00:00,000 --> 00:00:04,583 [MUSIC] 2 00:00:04,583 --> 00:00:08,886 In this video, we'll see different tricks that are very useful for 3 00:00:08,886 --> 00:00:10,784 training Gaussian process. 4 00:00:10,784 --> 00:00:15,820 And the first one is, what you should do when you see noisy observations? 5 00:00:15,820 --> 00:00:20,993 As youremember the mean would go exactly through the data points, and 6 00:00:20,993 --> 00:00:26,794 also the covariance, and actually variance would be zero at the data points. 7 00:00:26,794 --> 00:00:30,890 And if you fit the Gaussian process to a noisy signal 8 00:00:30,890 --> 00:00:35,374 like this you will get this quickly changing function. 9 00:00:35,374 --> 00:00:40,665 But actually you can see that there is something like a parabola here, 10 00:00:40,665 --> 00:00:43,093 and also some noise component. 11 00:00:43,093 --> 00:00:47,772 So let's modify our model in a way that it will have some 12 00:00:47,772 --> 00:00:50,530 notion of the noise in the data. 13 00:00:50,530 --> 00:00:55,480 The simplest way to do this is to add the independent Gaussian 14 00:00:55,480 --> 00:00:58,108 noise to all random variables. 15 00:00:58,108 --> 00:01:01,333 So have some new random variable f hat. 16 00:01:01,333 --> 00:01:06,315 It would be equal to the original random process f of x plus some new 17 00:01:06,315 --> 00:01:08,683 independent Gaussian noise. 18 00:01:08,683 --> 00:01:13,633 This means that, we'll independently sample it for 19 00:01:13,633 --> 00:01:18,474 each point of our space rd, that is different axis. 20 00:01:18,474 --> 00:01:25,341 We'll say that the mean of the noise is 0, and the variance would be s squared. 21 00:01:25,341 --> 00:01:28,928 In this case, the mean of the random process was to be 0. 22 00:01:28,928 --> 00:01:32,014 Since we have the sum of two means, 23 00:01:32,014 --> 00:01:36,708 the mean of f of x and epsilon, and those sum up to 0. 24 00:01:36,708 --> 00:01:39,290 And the covariance would change in the following way. 25 00:01:39,290 --> 00:01:44,295 The new covariance would be the old covariance K of xi minus xj, 26 00:01:44,295 --> 00:01:47,802 plus s squared, the variance of the noise. 27 00:01:47,802 --> 00:01:53,633 As an indicator that the points xi and xj are the same. 28 00:01:53,633 --> 00:01:58,906 This happened since there's no covariance between 29 00:01:58,906 --> 00:02:03,348 the noise samples in different positions. 30 00:02:03,348 --> 00:02:08,691 And if we fit the model using this covariance matrix using this kernel, 31 00:02:08,691 --> 00:02:11,284 we will get the following result. 32 00:02:11,284 --> 00:02:17,584 So as you can see, we don't have the 0 variance data points anymore. 33 00:02:17,584 --> 00:02:21,908 And also the mean function became a bit smoother. 34 00:02:21,908 --> 00:02:27,124 However, this still isn't the best we can have. 35 00:02:27,124 --> 00:02:31,952 We can change the parameters of the kernel a bit, and 36 00:02:31,952 --> 00:02:37,242 find the optimal values for them, in this special case. 37 00:02:37,242 --> 00:02:41,324 If for example we have the length scale equal to 0.01, 38 00:02:41,324 --> 00:02:46,760 the covariance will drop really quickly to 0 as we move away from the points. 39 00:02:46,760 --> 00:02:48,439 And so the prediction would look like this. 40 00:02:48,439 --> 00:02:50,835 And this is like the complete garbage. 41 00:02:50,835 --> 00:02:57,944 If we take the length scale to be equal to 10, then it would be too high. 42 00:02:57,944 --> 00:03:02,456 And the prediction will change through really slowly. 43 00:03:02,456 --> 00:03:06,626 And it would be 0 almost however and 44 00:03:06,626 --> 00:03:11,213 the variance would be like two the basic 45 00:03:11,213 --> 00:03:15,388 the variance of the prior process. 46 00:03:15,388 --> 00:03:20,186 So here we select the l to be 2, somewhere in the middle, and 47 00:03:20,186 --> 00:03:23,020 we'll have the process like this. 48 00:03:23,020 --> 00:03:28,745 It actually has some drawbacks, since as you can see, at the position -3 and 49 00:03:28,745 --> 00:03:34,149 3, the process could create and starts to reverse its prediction to 0. 50 00:03:34,149 --> 00:03:39,229 So maybe we could change some other parameters a bit better, 51 00:03:39,229 --> 00:03:43,125 like just letting sigma squared or x squared. 52 00:03:43,125 --> 00:03:48,122 And maybe we'll be able to feed Gaussian processing better and 53 00:03:48,122 --> 00:03:52,659 actually it turns out that we can so this automatically. 54 00:03:52,659 --> 00:03:56,187 We can have all our parameters for 55 00:03:56,187 --> 00:04:02,235 the Gaussian kernel will have sigma squared parameter, 56 00:04:02,235 --> 00:04:06,654 l, and s squared, so three parameters. 57 00:04:06,654 --> 00:04:09,975 We're going to tune them by maximizing the likelihood. 58 00:04:09,975 --> 00:04:14,915 So we take our data points, f of x1, f of x2 sub 1, f of xn and 59 00:04:14,915 --> 00:04:21,387 maximize the probability of this data process to observe given the parameters. 60 00:04:21,387 --> 00:04:26,852 Since everything is Gaussian for Gaussian process, it will be also Gaussian 61 00:04:26,852 --> 00:04:32,327 with mean 0 and the covariance matrix c as we have seen in the previous video. 62 00:04:32,327 --> 00:04:38,574 If you write down what the probability just fraction is equal to read carefully. 63 00:04:38,574 --> 00:04:43,334 You will see that you can optimize this value using simply the gradient ascent. 64 00:04:43,334 --> 00:04:48,141 And using this you will be able to automatically find the optimal values for 65 00:04:48,141 --> 00:04:53,880 the variance sigma squared, the variance s squared, and the length scale parameter l. 66 00:04:53,880 --> 00:04:58,560 So if you run this procedure, you will get something like this. 67 00:04:58,560 --> 00:05:04,058 So we estimated l to be 2, which actually is true. 68 00:05:04,058 --> 00:05:08,319 However, we spent some time doing it by hand, and 69 00:05:08,319 --> 00:05:11,876 this value was selected automatically. 70 00:05:11,876 --> 00:05:17,213 Also, we were able to estimate that the variance of the process 71 00:05:17,213 --> 00:05:22,867 should be 46.4 and the variance of the noise should be 0.7. 72 00:05:22,867 --> 00:05:28,879 As you can see, also on the boundaries, the prediction became a bit better. 73 00:05:28,879 --> 00:05:33,826 So the process doesn't reverse it's direction very quickly. 74 00:05:33,826 --> 00:05:40,849 Let's see how the fitting of this process works for different data points. 75 00:05:40,849 --> 00:05:45,782 In this case, I tried to fit the Gaussian process simply into a noise. 76 00:05:45,782 --> 00:05:51,026 In this case, the Gaussian process estimated that the s 77 00:05:51,026 --> 00:05:56,394 squared parameter versus the noise should be 0.79. 78 00:05:56,394 --> 00:06:01,239 It really believes that all the data 79 00:06:01,239 --> 00:06:05,608 that I gave him is signal noise. 80 00:06:05,608 --> 00:06:10,425 If I try fit a Gaussian process into a data that I sampled 81 00:06:10,425 --> 00:06:14,826 without noise, it will quickly understand it and 82 00:06:14,826 --> 00:06:19,465 will put the noise variance parameter to almost 0. 83 00:06:19,465 --> 00:06:24,625 So in this case it was like 5 times 10 to the power of -17, 84 00:06:24,625 --> 00:06:27,816 which is actually really close to 0. 85 00:06:27,816 --> 00:06:32,684 If however I have the process that has some signal but it also has noise, 86 00:06:32,684 --> 00:06:36,419 you automatically find that the noise variance should 87 00:06:36,419 --> 00:06:40,331 be somewhere in-between 0 and some larger variables. 88 00:06:40,331 --> 00:06:44,244 In this case it estimated to be 0.13. 89 00:06:44,244 --> 00:06:48,621 All right, now let's see how Gaussian process can be 90 00:06:48,621 --> 00:06:51,836 applied to classification problems. 91 00:06:51,836 --> 00:06:56,287 Previously we saw how they can be use for regression, and for 92 00:06:56,287 --> 00:06:58,910 classification it is a bit harder. 93 00:06:58,910 --> 00:07:03,673 So have two labels, two possible labels plus 1 or -1. 94 00:07:03,673 --> 00:07:09,868 We can use latent process f of x, this will show something like, 95 00:07:09,868 --> 00:07:14,591 how sure we are in predicting this or that label. 96 00:07:14,591 --> 00:07:19,444 And if we fit somehow the latent process f of x, we will be able to do 97 00:07:19,444 --> 00:07:24,854 predictions by passing the latent process through a sigmoid function. 98 00:07:24,854 --> 00:07:29,651 So the probability of the label y given f will be simply 99 00:07:29,651 --> 00:07:32,634 an 1 over 1 + exponent of -yf, 100 00:07:32,634 --> 00:07:37,662 which is the sigmoid function of the products y times f. 101 00:07:37,662 --> 00:07:44,781 So to train this model, you will first have to estimate the latent process. 102 00:07:44,781 --> 00:07:50,322 You'll have to estimate the probability of the latent process 103 00:07:50,322 --> 00:07:55,979 in some arbitrary points given the labels that we already know. 104 00:07:55,979 --> 00:07:59,724 So y1 of x1 for example could be plus 1, 105 00:07:59,724 --> 00:08:05,181 y of x could be -1, and so those are just binary decisions. 106 00:08:05,181 --> 00:08:08,710 So we estimated the latent process and 107 00:08:08,710 --> 00:08:13,242 then we could use it to compute the predictions. 108 00:08:13,242 --> 00:08:18,163 We could do this by marginalizing the general probability of labels and 109 00:08:18,163 --> 00:08:19,644 the latent process. 110 00:08:19,644 --> 00:08:22,409 This would be just the simple intro code, 111 00:08:22,409 --> 00:08:26,124 the probability of the label given the latent process. 112 00:08:26,124 --> 00:08:28,860 At time the probability of the latent process and 113 00:08:28,860 --> 00:08:31,939 it is integrated over all possible latent processes. 114 00:08:31,939 --> 00:08:37,200 So the mess here is a bit complex, so I'll skip it for 115 00:08:37,200 --> 00:08:41,877 now and lets just see how the condition works. 116 00:08:41,877 --> 00:08:47,772 So the first step as I said is estimation of the latent process. 117 00:08:47,772 --> 00:08:53,818 So in this case I have the latent points marked as crosses here, 118 00:08:53,818 --> 00:08:59,076 some have the value plus 1, some have the value -1. 119 00:08:59,076 --> 00:09:02,715 And if we fit the latent processes look like this. 120 00:09:02,715 --> 00:09:10,381 So as you can see, as we go to the area where all points have the labels +1, 121 00:09:10,381 --> 00:09:15,887 the values for the latent process would be positive. 122 00:09:15,887 --> 00:09:16,485 And for 123 00:09:16,485 --> 00:09:22,583 the negative examples the probability of the process will be negative. 124 00:09:22,583 --> 00:09:26,198 And here are our predictions, 125 00:09:26,198 --> 00:09:31,627 I just took the latent process and [INAUDIBLE]. 126 00:09:31,627 --> 00:09:36,627 So as you can see it is almost one in the positions where there 127 00:09:36,627 --> 00:09:39,535 are many positive points nearby. 128 00:09:39,535 --> 00:09:44,007 The same happens for the negative examples, and 129 00:09:44,007 --> 00:09:50,567 in the points where the targets change from plus 1 to -1 for example. 130 00:09:50,567 --> 00:09:56,081 The variance would be high and the prediction would be, 131 00:09:56,081 --> 00:09:59,725 won't be so certain in this points. 132 00:09:59,725 --> 00:10:03,761 So for example, somewhere in between -1 and -2, 133 00:10:03,761 --> 00:10:08,524 the value of the prediction would be somewhere around 0.5. 134 00:10:08,524 --> 00:10:12,616 It is almost absolutely not sure about the prediction. 135 00:10:15,014 --> 00:10:21,662 One last thing I want to tell you about, our inducing inputs. 136 00:10:21,662 --> 00:10:24,169 When you train a Gaussian process, 137 00:10:24,169 --> 00:10:28,272 it turns out that it is quite computationally expensive. 138 00:10:28,272 --> 00:10:33,112 If you have end points then computing the prediction will cost you 139 00:10:33,112 --> 00:10:38,136 an order of n cube, since you have to invert the covariance matrix. 140 00:10:38,136 --> 00:10:43,689 There is one simple idea called inducing inputs to speed up the Gaussian process. 141 00:10:43,689 --> 00:10:48,408 What you could is you could replace the dataset with a small 142 00:10:48,408 --> 00:10:50,341 number of data points. 143 00:10:50,341 --> 00:10:55,564 For example, those would be like support points in the SVM. 144 00:10:55,564 --> 00:10:59,867 And then you will try to fit the Gaussian process using only those points. 145 00:10:59,867 --> 00:11:04,381 You will select end points as inducing end points. 146 00:11:04,381 --> 00:11:08,668 The precomputing would cost us m squared times n, 147 00:11:08,668 --> 00:11:13,680 which is quite fast when we select small number of points. 148 00:11:13,680 --> 00:11:18,712 Computing the mean would be an order of m, which is almost instantly. 149 00:11:18,712 --> 00:11:24,610 And predicting the variance at h point will cost us an order of m squared. 150 00:11:24,610 --> 00:11:28,034 You can optimize the position, so in these primes and 151 00:11:28,034 --> 00:11:31,698 the division for them using maximal liked destination. 152 00:11:31,698 --> 00:11:33,500 So let's see how it works. 153 00:11:33,500 --> 00:11:38,708 Here I have 100 points and I fitted the Gaussian process using them. 154 00:11:38,708 --> 00:11:42,938 However here, I selected 10 points informally, and 155 00:11:42,938 --> 00:11:45,731 fit the Gaussian process into them. 156 00:11:45,731 --> 00:11:51,898 And as you can see, the values of the Gaussian process didn't change much. 157 00:11:51,898 --> 00:11:54,983 However, the cost of the prediction is much less, 158 00:11:54,983 --> 00:11:58,369 since we have 10 times less points than we had before. 159 00:11:58,369 --> 00:12:08,369 [MUSIC]