1 00:00:03,740 --> 00:00:09,815 So we've seen how Gaussian process can be used for regression classification problems. 2 00:00:09,815 --> 00:00:14,025 Now let's see how they can be used for optimization problems. 3 00:00:14,025 --> 00:00:17,670 So, what is a Black-box optimization? 4 00:00:17,670 --> 00:00:21,900 You have some function f(x) you don't know anything about it and you want to find 5 00:00:21,900 --> 00:00:26,539 out what is the value of the maximum of this function. 6 00:00:26,539 --> 00:00:29,370 In the cases when the gradient is known, 7 00:00:29,370 --> 00:00:32,625 you can use a gradient descent with restarts. 8 00:00:32,625 --> 00:00:35,550 In the case when gradient is unknown you can, for example, 9 00:00:35,550 --> 00:00:38,040 numerically estimate the gradient or if 10 00:00:38,040 --> 00:00:42,130 the function is not differential you can use grid search or random search. 11 00:00:42,130 --> 00:00:45,795 However, there is one special case of 12 00:00:45,795 --> 00:00:51,000 optimization problems when computing each point is really expensive. 13 00:00:51,000 --> 00:00:56,870 For example, you want to find out where the oil is and you 14 00:00:56,870 --> 00:01:03,160 should do some probe drilling to estimate the amount of oil in some points. 15 00:01:03,160 --> 00:01:04,775 So X in this case would be 16 00:01:04,775 --> 00:01:09,470 geographic coordinates and F of X would be for example the amount of oil. 17 00:01:09,470 --> 00:01:16,200 In this case the cost of the sampled cost would be like $1 million. 18 00:01:16,200 --> 00:01:20,035 If you would like to estimate the hyper parameters of 19 00:01:20,035 --> 00:01:24,280 the neural network in this case the X 20 00:01:24,280 --> 00:01:28,525 would be versus hyperparameters for example the size of the layers, 21 00:01:28,525 --> 00:01:30,790 the number of layers and so on. 22 00:01:30,790 --> 00:01:34,660 And F of X would be some objective function. 23 00:01:34,660 --> 00:01:38,280 In this case the cost of one sample would be 10 hours. 24 00:01:38,280 --> 00:01:43,080 Since you would have to wait for this time until the network will converge. 25 00:01:43,080 --> 00:01:45,710 In the drug discovery, 26 00:01:45,710 --> 00:01:47,868 it is even more important. 27 00:01:47,868 --> 00:01:50,870 The X could be the drug and F of X would 28 00:01:50,870 --> 00:01:54,740 be the effect of this against some severe disease. 29 00:01:54,740 --> 00:01:56,777 In this case one sample, 30 00:01:56,777 --> 00:02:00,200 that is one test of the proposed drug, 31 00:02:00,200 --> 00:02:04,010 would cost you 2 months of clinical trials, 32 00:02:04,010 --> 00:02:08,570 $10,000 for example for paying for assistance and 33 00:02:08,570 --> 00:02:13,940 also maybe a life or a rat if you do experiments on the rats. 34 00:02:13,940 --> 00:02:17,425 So these are cases when the cost of 35 00:02:17,425 --> 00:02:21,905 evaluating a function at any point is really, really expensive. 36 00:02:21,905 --> 00:02:26,330 And we will see how Gaussian process can be used to find the maximum of 37 00:02:26,330 --> 00:02:34,095 the function by minimizing the number of samples that you have to make. 38 00:02:34,095 --> 00:02:38,110 So again, our goal is to optimize a function that is to find 39 00:02:38,110 --> 00:02:43,360 its maximum and also to minimize the number of trials. 40 00:02:43,360 --> 00:02:48,693 The key here is to define a surrogate model. 41 00:02:48,693 --> 00:02:50,630 A surrogate model, we'll call it Fhat, 42 00:02:50,630 --> 00:02:55,180 shouldn't be approximately equal to the original function. 43 00:02:55,180 --> 00:02:58,170 However, we want it to be cheap to evaluate 44 00:02:58,170 --> 00:03:02,805 and also it should approximate the function well. 45 00:03:02,805 --> 00:03:05,953 In this case we'd be able to find the maximum value for example 46 00:03:05,953 --> 00:03:13,050 the surrogate model and then to sample the original function at this point. 47 00:03:13,050 --> 00:03:15,690 We'll also need an acquisition function. 48 00:03:15,690 --> 00:03:17,895 We will call it mewe of X. 49 00:03:17,895 --> 00:03:23,685 It will show how likely for us it is to sample least for that point. 50 00:03:23,685 --> 00:03:29,810 We should estimate the profit for optimization from sampling the point X. 51 00:03:29,810 --> 00:03:32,355 And it should use a surrogate model 52 00:03:32,355 --> 00:03:36,280 since the surrogate model is really cheap to evaluate. 53 00:03:36,280 --> 00:03:39,200 So first of all, 54 00:03:39,200 --> 00:03:43,030 let's see which surrogate model we will use. 55 00:03:43,030 --> 00:03:49,120 The model should be able to approximate arbitrary complex functions. 56 00:03:49,120 --> 00:03:53,200 So the natural choice here would be a nonparametric model. 57 00:03:53,200 --> 00:03:57,370 Also it should be profitable to estimate 58 00:03:57,370 --> 00:04:02,580 uncertainty so the Gaussian process is a really good choice for this case. 59 00:04:02,580 --> 00:04:05,630 Now let's see what we need from an acquisition function. 60 00:04:05,630 --> 00:04:11,180 It should balance between exploration and exploitation. 61 00:04:11,180 --> 00:04:15,975 The exploration is that you search in regions with higher uncertainty. 62 00:04:15,975 --> 00:04:18,980 Since the uncertainty there is high, 63 00:04:18,980 --> 00:04:22,950 there is some chance that you'll get a new maximum layer. 64 00:04:22,950 --> 00:04:28,465 Exploitation is searching in the regions where the value is already estimated to be high. 65 00:04:28,465 --> 00:04:30,350 There may be low uncertainty, 66 00:04:30,350 --> 00:04:32,710 but you will still be able to compute 67 00:04:32,710 --> 00:04:36,230 the value of the position of the maximum more precisely. 68 00:04:36,230 --> 00:04:41,605 So, now let's see three different kinds of acquisition function. 69 00:04:41,605 --> 00:04:46,230 And the first is the maximum probability of improvement or MPI for short. 70 00:04:46,230 --> 00:04:51,522 So, we have the current best value. 71 00:04:51,522 --> 00:04:54,420 Let's write it down as f*. 72 00:04:54,420 --> 00:05:00,465 And the acquisition function would be the probability that after sampling the point X, 73 00:05:00,465 --> 00:05:07,570 that the value of this function would be greater or equal to the current best value. 74 00:05:07,570 --> 00:05:10,560 So this is a probability that Fhat, the surrogate model, 75 00:05:10,560 --> 00:05:13,425 would be a greater or equal to f*, the current best, 76 00:05:13,425 --> 00:05:17,595 plus some parameter X1. 77 00:05:17,595 --> 00:05:19,160 So, since everything is normal, 78 00:05:19,160 --> 00:05:21,320 it is really easy to compute this value. 79 00:05:21,320 --> 00:05:25,162 And it can be written using the cumulative objective function 80 00:05:25,162 --> 00:05:30,980 of the normal distribution as shown on the right. 81 00:05:30,980 --> 00:05:34,240 Another one is called an upper confidence bound. 82 00:05:34,240 --> 00:05:39,705 And that is you try to take the mean value 83 00:05:39,705 --> 00:05:45,880 and add etha times the variance of the surrogate model of this point. 84 00:05:45,880 --> 00:05:51,640 So, for example you can sample at points where there is high mean or 85 00:05:51,640 --> 00:05:58,170 the points where there is high variance by varying the value of etha. 86 00:05:58,170 --> 00:06:04,217 The third acquisition function that we will see is called an Expected Improvement. 87 00:06:04,217 --> 00:06:05,430 As the title suggests, 88 00:06:05,430 --> 00:06:08,040 you just compute the expected value 89 00:06:08,040 --> 00:06:12,285 of the gain that you'll get after sampling the new point. 90 00:06:12,285 --> 00:06:14,925 Also, again, since everything is normal you can compute 91 00:06:14,925 --> 00:06:18,590 an articulate and the formula would be like this. 92 00:06:18,590 --> 00:06:20,706 So let's see an example. 93 00:06:20,706 --> 00:06:22,495 We start with some few points, 94 00:06:22,495 --> 00:06:24,685 in this case three points. 95 00:06:24,685 --> 00:06:28,830 And while the process is in converge you train the Gaussian process. 96 00:06:28,830 --> 00:06:33,565 You find the maximum of an acquisition function for example using 97 00:06:33,565 --> 00:06:38,070 the gradient descent or some other optimization techniques. 98 00:06:38,070 --> 00:06:41,580 And since computing the values of the surrogate model, 99 00:06:41,580 --> 00:06:45,740 the Gaussian process are relatively cheap, 100 00:06:45,740 --> 00:06:50,205 this process won't take much time. 101 00:06:50,205 --> 00:06:53,460 And then you evaluate the function of 102 00:06:53,460 --> 00:06:59,040 the position of the maximum of an acquisition function. 103 00:06:59,040 --> 00:07:02,143 Then again you train the Gaussian process with new point, 104 00:07:02,143 --> 00:07:04,719 you find you find the maximum again, 105 00:07:04,719 --> 00:07:06,830 values at the new point, and so on. 106 00:07:06,830 --> 00:07:08,760 So let's see how it works. 107 00:07:08,760 --> 00:07:10,910 So again here I have three points. 108 00:07:10,910 --> 00:07:17,685 Now I add a new one and you see below this is the plot of an acquisition function. 109 00:07:17,685 --> 00:07:20,660 And we'll find the maximum fit. 110 00:07:20,660 --> 00:07:22,945 It would be somewhere on the right. 111 00:07:22,945 --> 00:07:26,685 And now you can see that the points on the left 112 00:07:26,685 --> 00:07:33,355 have a relatively small value of the acquisition function. 113 00:07:33,355 --> 00:07:36,120 So the mean is quite low relative to 114 00:07:36,120 --> 00:07:40,238 the current best point and also the variance is small. 115 00:07:40,238 --> 00:07:44,310 And it is very unlikely that we will ever be 116 00:07:44,310 --> 00:07:49,135 able to get the point that has a value greater than the current optimum. 117 00:07:49,135 --> 00:07:55,043 And so we'll sample near the current optimum for the rest of the experiment. 118 00:07:55,043 --> 00:07:59,766 So as you see we change our function a bit 119 00:07:59,766 --> 00:08:06,030 and now we're almost sure that the position of the maximum is somewhere near this point. 120 00:08:06,030 --> 00:08:08,477 And so, finally, after sampling a few points, 121 00:08:08,477 --> 00:08:10,530 we find the position of the maximum. 122 00:08:10,530 --> 00:08:15,570 So, it is very important to compare random search in Gaussian process. 123 00:08:15,570 --> 00:08:21,300 For random search, it is really easy to parallelize it. 124 00:08:21,300 --> 00:08:23,760 For example, it is really easy to sample 125 00:08:23,760 --> 00:08:27,930 some random points and then compute the value of the function then. 126 00:08:27,930 --> 00:08:29,760 However for Gaussian process, 127 00:08:29,760 --> 00:08:35,880 you have to come up with some heuristics to sample not one point but many. 128 00:08:35,880 --> 00:08:42,195 And why heuristics is to take the maximum of an acquisition function and replace and 129 00:08:42,195 --> 00:08:45,390 then add a new point to the Gaussian process at this point with 130 00:08:45,390 --> 00:08:48,890 the value being the mean of the Gaussian process. 131 00:08:48,890 --> 00:08:51,540 Then we compute the maximum of 132 00:08:51,540 --> 00:08:55,720 the acquisition function again and take the maximum value of it again, 133 00:08:55,720 --> 00:08:58,988 and repeat it as many times as you want. 134 00:08:58,988 --> 00:09:06,740 And so you'll have a set of points and you'll be able to then sample them in parallel. 135 00:09:06,740 --> 00:09:13,642 Now the random search actually needs many more points on average to converge, 136 00:09:13,642 --> 00:09:17,345 and also it works poorly in high dimensional spaces. 137 00:09:17,345 --> 00:09:20,340 For example, in the plot below you can see that the Gaussian process 138 00:09:20,340 --> 00:09:24,017 converges much faster than the random search, 139 00:09:24,017 --> 00:09:27,870 and also it converges at the better optimum. 140 00:09:27,870 --> 00:09:31,960 Also the random search works for any function 141 00:09:31,960 --> 00:09:37,770 since sampling the random X is usually quite cheap. 142 00:09:37,770 --> 00:09:41,070 However, it is important to know that 143 00:09:41,070 --> 00:09:45,225 using Bayesian optimization with Gaussian process is useful 144 00:09:45,225 --> 00:09:48,300 only when each relation of the function is 145 00:09:48,300 --> 00:09:53,540 expensive since training the Gaussian process also takes some time. 146 00:09:53,540 --> 00:09:58,725 And if it takes more time than evaluating the true function, 147 00:09:58,725 --> 00:10:02,530 then you'd better use the random search.