1 00:00:00,000 --> 00:00:03,449 [NOISE] In this example, 2 00:00:03,449 --> 00:00:08,110 we will see linear regression. 3 00:00:08,110 --> 00:00:12,034 But before we start, we need to define the multivariate and 4 00:00:12,034 --> 00:00:14,476 univariate normal distributions. 5 00:00:14,476 --> 00:00:19,632 The univariate normal distribution has the following probability density function. 6 00:00:19,632 --> 00:00:22,262 It has two parameters, mu and sigma. 7 00:00:22,262 --> 00:00:27,485 The mu is a mean of the random variable, and the sigma squared is its variance. 8 00:00:27,485 --> 00:00:29,728 Its functional form is given as follows. 9 00:00:29,728 --> 00:00:34,367 It is some normalization constant that ensures that this probability 10 00:00:34,367 --> 00:00:39,256 density function integrates to 1, times the exponent of the parabola. 11 00:00:39,256 --> 00:00:43,808 The maximum value of this parabola is at point mu. 12 00:00:43,808 --> 00:00:49,243 And so the mode of the distribution would also be the point mu. 13 00:00:49,243 --> 00:00:56,101 If we vary the parameter mu, we will get different probability densities. 14 00:00:56,101 --> 00:01:00,715 For example, for the green one, we'll have the mu equal to -4, and for 15 00:01:00,715 --> 00:01:03,256 the red one, we'll have mu equal to 4. 16 00:01:03,256 --> 00:01:06,274 If we vary the parameter sigma squared, 17 00:01:06,274 --> 00:01:09,840 we will get either sharp distribution or wide. 18 00:01:09,840 --> 00:01:14,400 The blue curve has the variance equal to 1, and 19 00:01:14,400 --> 00:01:18,774 the red one has variance equal to 9. 20 00:01:18,774 --> 00:01:21,970 The multivariate case looks exactly the same. 21 00:01:21,970 --> 00:01:25,348 We have two parameters, mu and sigma. 22 00:01:25,348 --> 00:01:29,705 The mu is the mean vector, and the sigma is a covariance matrix. 23 00:01:29,705 --> 00:01:34,363 We, again, have some normalization constant, to ensure that the probability 24 00:01:34,363 --> 00:01:39,102 density function integrates to 1, and some quadratic term under the exponent. 25 00:01:39,102 --> 00:01:44,295 Again, the maximum value of the probability density function is at mu, 26 00:01:44,295 --> 00:01:48,310 and so the mode of distribution will also be equal to mu. 27 00:01:48,310 --> 00:01:51,982 In neural networks, for example, where we have a lot of parameters. 28 00:01:51,982 --> 00:01:55,738 Let's note the number of parameters as t. 29 00:01:55,738 --> 00:02:00,651 The sigma matrix has a lot of parameters, about D squared. 30 00:02:00,651 --> 00:02:07,077 Actually, since sigma is symmetric, we need D (D+1) / 2 parameters. 31 00:02:07,077 --> 00:02:11,639 It may be really costly to store such matrix, so we can use approximation. 32 00:02:11,639 --> 00:02:14,630 For example, we can use diagonal matrices. 33 00:02:14,630 --> 00:02:20,097 In this case, all elements that are not on the diagonal will be zero, 34 00:02:20,097 --> 00:02:23,316 and then we will have only D parameters. 35 00:02:23,316 --> 00:02:26,247 An even more simple case has only one parameter, 36 00:02:26,247 --> 00:02:29,118 it is called a spherical normal distribution. 37 00:02:29,118 --> 00:02:34,806 In this case, the signal matrix equals to some scalar times the identity matrix. 38 00:02:34,806 --> 00:02:37,305 Now let's talk about linear regression. 39 00:02:37,305 --> 00:02:41,502 In linear regression, we want to feed a straight line into data. 40 00:02:41,502 --> 00:02:44,135 We feed it in the following way. 41 00:02:44,135 --> 00:02:47,919 You want to minimize the errors, and those are, 42 00:02:47,919 --> 00:02:53,376 the red line is the prediction and the blue points are the true values. 43 00:02:53,376 --> 00:02:57,695 And you want, somehow, to minimize those black lines. 44 00:02:57,695 --> 00:03:02,720 The line is usually found with so-called least squares problem. 45 00:03:02,720 --> 00:03:06,485 Our straight line is parameterized by weights, vector, and w. 46 00:03:06,485 --> 00:03:11,568 The prediction of each point is computed as w transposed times xi, 47 00:03:11,568 --> 00:03:13,305 where xi is our point. 48 00:03:13,305 --> 00:03:15,971 Then, we compute the total sum squares, that is, 49 00:03:15,971 --> 00:03:19,454 the difference between the prediction and the true value square. 50 00:03:19,454 --> 00:03:24,705 And we try to find the vector w that minimizes this function. 51 00:03:24,705 --> 00:03:28,267 Let's see how this one works for the Bayesian perspective. 52 00:03:28,267 --> 00:03:29,346 Here's our model. 53 00:03:29,346 --> 00:03:35,135 We have three random variables, the weights, the data, and the target. 54 00:03:35,135 --> 00:03:39,927 We're actually not interested in modeling the data, so we can write down the joint 55 00:03:39,927 --> 00:03:43,428 probability of the weights and the target, given the data. 56 00:03:43,428 --> 00:03:45,530 This will be given by the following formula. 57 00:03:45,530 --> 00:03:49,216 It would be the probability of target given the weights of the data, and 58 00:03:49,216 --> 00:03:50,942 the probability of the weights. 59 00:03:50,942 --> 00:03:53,812 Now we need to define these two distributions. 60 00:03:53,812 --> 00:03:56,107 Let's assume them to be normal. 61 00:03:56,107 --> 00:03:59,292 The probability of target given the weights and 62 00:03:59,292 --> 00:04:04,545 data would be a Gaussian centered as a prediction that is double transposed X, 63 00:04:04,545 --> 00:04:09,100 and the variance equal to sigma squared times the identity matrix. 64 00:04:09,100 --> 00:04:15,359 Finally, the probability of the weights would be a Gaussian centered around zero, 65 00:04:15,359 --> 00:04:20,310 with the covariance matrix sigma squared times identity matrix. 66 00:04:22,079 --> 00:04:28,273 All right, so here are our formulas, and now let's train the linear regression. 67 00:04:28,273 --> 00:04:30,750 So we'll do this in the following way. 68 00:04:30,750 --> 00:04:36,129 Let's compute the posterior probability over the weights, given the data. 69 00:04:36,129 --> 00:04:43,194 So this would be probability of parameters given and 70 00:04:43,194 --> 00:04:47,375 the data, so those are y and x. 71 00:04:47,375 --> 00:04:51,158 So using a definition of 72 00:04:51,158 --> 00:04:56,265 the conditional probability, 73 00:04:56,265 --> 00:05:00,802 we can write that it is P (y, 74 00:05:00,802 --> 00:05:04,407 w | X) / P (y | x). 75 00:05:04,407 --> 00:05:09,715 So let's try not to compute the full posterior distribution, but 76 00:05:09,715 --> 00:05:16,184 to compute the value at which there is a maximum of this posterior distribution. 77 00:05:16,184 --> 00:05:22,855 So we'll try to maximize this with respect to the weights. 78 00:05:22,855 --> 00:05:28,050 We can notice that the denominator does not depend on the weights, 79 00:05:28,050 --> 00:05:33,172 and so we can maximize only the numerator, so we can cross it out. 80 00:05:33,172 --> 00:05:38,915 All right, so now we should 81 00:05:38,915 --> 00:05:44,419 maximize P (y, w | X). 82 00:05:44,419 --> 00:05:47,969 And this actually given by our model. 83 00:05:47,969 --> 00:05:53,325 So we can plug in this formula, 84 00:05:53,325 --> 00:05:59,716 this would be P (y | X, w) p (w). 85 00:06:01,914 --> 00:06:08,426 And we want to maximize it with respect to the weights. 86 00:06:08,426 --> 00:06:11,770 All right, we can take the logarithm of this part, and 87 00:06:11,770 --> 00:06:16,589 since the logarithm is concave, the position of the maximum will not change. 88 00:06:16,589 --> 00:06:21,475 So we can take the logarithm of theta here, 89 00:06:21,475 --> 00:06:24,337 and the logarithm here. 90 00:06:27,128 --> 00:06:31,762 And so this will be equivalent to the previous problem. 91 00:06:31,762 --> 00:06:37,810 All right, now we can plug in the formulas and try to solve the optimization problem. 92 00:06:40,127 --> 00:06:45,171 So we have 93 00:06:45,171 --> 00:06:52,737 log P (y | X, 94 00:06:52,737 --> 00:06:57,782 w) + log 95 00:06:57,782 --> 00:07:02,831 P (w). 96 00:07:04,343 --> 00:07:08,007 We can plug in the formulas for the normal distribution and 97 00:07:08,007 --> 00:07:09,965 obtain the following result. 98 00:07:09,965 --> 00:07:15,772 So it will be log of some normalization 99 00:07:15,772 --> 00:07:21,038 constant C1 x exp(-1/2). 100 00:07:21,038 --> 00:07:24,970 So the mean is w transposed x, so 101 00:07:24,970 --> 00:07:29,632 this would be (y- w transposed x), 102 00:07:29,632 --> 00:07:35,182 times the inverse of the covariance matrix. 103 00:07:35,182 --> 00:07:42,431 So it would be sigma squared I inversed, 104 00:07:42,431 --> 00:07:48,626 and finally, y- w transposed x. 105 00:07:48,626 --> 00:07:53,033 And we have to close all the brackets, right? 106 00:07:53,033 --> 00:07:59,474 And in a similar way, we can write down the second term, 107 00:07:59,474 --> 00:08:04,545 so this would be log C2 x exp(-1/2), 108 00:08:04,545 --> 00:08:09,753 and this would be w transposed gamma squared I 109 00:08:09,753 --> 00:08:14,982 inverse w transposed, since the mean is 0. 110 00:08:14,982 --> 00:08:21,093 All right, so we can take the constants out of the logarithm, and 111 00:08:21,093 --> 00:08:26,997 also the logarithm of the exponent is just identity function. 112 00:08:26,997 --> 00:08:33,846 So what we'll have left is minus one-half. 113 00:08:33,846 --> 00:08:37,750 The inverse of identity matrix is identity matrix, 114 00:08:37,750 --> 00:08:42,198 and the inverse of sigma squared is one over sigma squared. 115 00:08:42,198 --> 00:08:45,489 So we'll have something like this. 116 00:08:45,489 --> 00:08:53,627 Y- w transposed x transposed x y- w transposed x. 117 00:08:53,627 --> 00:08:58,808 And finally, we'll have a term- 118 00:08:58,808 --> 00:09:04,531 1 / 2 gamma squared w transposed w. 119 00:09:04,531 --> 00:09:12,089 This thing is actually a norm, so we'll have a norm of w squared. 120 00:09:12,089 --> 00:09:15,735 This is w squared. 121 00:09:15,735 --> 00:09:20,841 And this is also a norm of y- 122 00:09:20,841 --> 00:09:25,503 w transposed x squared. 123 00:09:25,503 --> 00:09:32,214 So we try to maximize this thing, with respect to w. 124 00:09:32,214 --> 00:09:36,655 It will multiply it by- 1 and 125 00:09:36,655 --> 00:09:42,705 also to sigma, times to sigma squared. 126 00:09:42,705 --> 00:09:47,182 We'll count to the minimization problem from the maximization problem. 127 00:09:47,182 --> 00:09:54,686 And finally, the formula would be the norm of this thing squared, 128 00:09:54,686 --> 00:09:59,912 plus some constant lambda that equals to sigma 129 00:09:59,912 --> 00:10:06,482 squared over gamma squared, times norm of the w squared. 130 00:10:06,482 --> 00:10:12,378 And since we multiplied by 1, it is a minimization problem. 131 00:10:12,378 --> 00:10:18,136 So actually, the first term is sum of squares. 132 00:10:18,136 --> 00:10:21,803 So we solved the least squares problem. 133 00:10:21,803 --> 00:10:25,264 And the second term is a L2 regularizer. 134 00:10:25,264 --> 00:10:32,663 And so by adding a normal prior on the weights, 135 00:10:32,663 --> 00:10:37,977 we turned from this quest problem 136 00:10:37,977 --> 00:10:45,011 to the L2 regularized linear regression. 137 00:10:45,011 --> 00:10:47,738 [SOUND] 138 00:10:47,738 --> 00:10:49,179 [MUSIC]