1 00:00:00,000 --> 00:00:04,413 [MUSIC] 2 00:00:04,413 --> 00:00:08,131 Hi, I am Dmitry Vetrov, research professor from Higher School of Economics and 3 00:00:08,131 --> 00:00:11,973 head of Bayesian Method Research Group and scientific advisor of Alex and Daniel. 4 00:00:11,973 --> 00:00:14,732 Under this lecture, I would like to tell you about one 5 00:00:14,732 --> 00:00:18,860 successful example of how [INAUDIBLE] can be combined with Bayesian influence. 6 00:00:18,860 --> 00:00:20,006 Under this lecture, 7 00:00:20,006 --> 00:00:23,643 we will briefly review how budget methods can be scaled to big data. 8 00:00:23,643 --> 00:00:27,836 So, suppose we are given a set of machine problem containing data X,Y, 9 00:00:27,836 --> 00:00:32,460 where X contains observed variables and Y is hidden variables to be predicted. 10 00:00:32,460 --> 00:00:36,883 And we have a probabilistic classifier, which gives us the probabilities of 11 00:00:36,883 --> 00:00:41,590 a hidden components, given observed ones, which is parameterized by y|x, W. 12 00:00:41,590 --> 00:00:46,305 Since we are Bayesians, we also establish reasonable prior, p(W). 13 00:00:46,305 --> 00:00:49,946 And from the Bayesian point of view at the training stage, 14 00:00:49,946 --> 00:00:54,046 we need to compute two posterior distribution, p(W|X, Y). 15 00:00:54,046 --> 00:00:57,594 This posterior distribution contains all available information about W 16 00:00:57,594 --> 00:00:59,604 we could extract from our training data. 17 00:00:59,604 --> 00:01:02,575 And this is the result of Bayesian training. 18 00:01:02,575 --> 00:01:03,687 At the test stage, 19 00:01:03,687 --> 00:01:08,144 we need to perform [INAUDIBLE] with respect to this posterior distribution. 20 00:01:08,144 --> 00:01:11,362 So we're not applying just single classifier, 21 00:01:11,362 --> 00:01:16,226 we're applying the sample and the weights of each classifier are given by 22 00:01:16,226 --> 00:01:19,456 our posterior distribution p(W) given X, Y. 23 00:01:19,456 --> 00:01:23,208 So this is how it should work in theory, but in practice of course that is not so, 24 00:01:23,208 --> 00:01:25,231 and the problem is in these two integrals. 25 00:01:25,231 --> 00:01:30,306 So they're usually intractable, they're usually in huge dimensional spaces. 26 00:01:30,306 --> 00:01:32,586 For example, for the case of deep learning, 27 00:01:32,586 --> 00:01:35,836 the dimensionality of W can be about tens of millions of per mix. 28 00:01:35,836 --> 00:01:38,527 And since the integrals are intractable, 29 00:01:38,527 --> 00:01:41,525 we cannot even very roughly approximate them. 30 00:01:41,525 --> 00:01:43,931 This was the reason why, until very recently, 31 00:01:43,931 --> 00:01:46,710 Bayesian methods were considered to be not scalable. 32 00:01:46,710 --> 00:01:49,582 The situation has changed with the development of so 33 00:01:49,582 --> 00:01:52,070 called stochastic variational inference. 34 00:01:52,070 --> 00:01:55,557 And sort of trying to solve basic inference problem exactly, 35 00:01:55,557 --> 00:01:58,180 that is to find true [INAUDIBLE] distribution. 36 00:01:58,180 --> 00:02:03,063 P(w) given X, Y [INAUDIBLE] distribution from some parametric family, 37 00:02:03,063 --> 00:02:05,750 so the distribution is q of w given five. 38 00:02:05,750 --> 00:02:11,230 [INAUDIBLE] Approximated by minimizing some kind of distance measure between 39 00:02:11,230 --> 00:02:17,144 the two distributions between our rational approximation and the two posterior. 40 00:02:17,144 --> 00:02:19,496 There can be different distance measures, but 41 00:02:19,496 --> 00:02:23,035 one of the most popular one is so-called KL divergence between q and p. 42 00:02:23,035 --> 00:02:28,616 As was mentioned in previous lectures, in this case, the optimization problem 43 00:02:28,616 --> 00:02:34,290 is exactly equivalent to maximizing so-called ELBO, or evidence lower bound. 44 00:02:34,290 --> 00:02:37,154 So ELBO itself is an integral, again, in a huge dimensional space. 45 00:02:37,154 --> 00:02:40,416 And the integral itself is still untaxable. 46 00:02:40,416 --> 00:02:43,531 But we do not need to compute the integral exactly. 47 00:02:43,531 --> 00:02:47,300 All we need to do is to optimize the integral with the [INAUDIBLE]. 48 00:02:47,300 --> 00:02:48,264 And surprisingly, 49 00:02:48,264 --> 00:02:51,960 it appears that this is possible using stochastic optimization framework. 50 00:02:51,960 --> 00:02:55,652 So our ELBO actually has several very nice purposes. 51 00:02:55,652 --> 00:03:01,131 One of them is the obtaining of likelihood, p of Y given X, W. 52 00:03:01,131 --> 00:03:02,373 It's inside a logarithm. 53 00:03:02,373 --> 00:03:06,659 This means that it can be split into the sum of individual likelihoods of obtaining 54 00:03:06,659 --> 00:03:07,228 objects. 55 00:03:07,228 --> 00:03:11,012 And this means that additionalization of our optimization, 56 00:03:11,012 --> 00:03:14,661 we do not need to compute the whole training of likelihood. 57 00:03:14,661 --> 00:03:20,150 We can compute only its unbiased estimate given by a tiny mini batch of data. 58 00:03:20,150 --> 00:03:23,604 So this means that ELBO supports mini batching. 59 00:03:23,604 --> 00:03:28,393 Another good property is that ELBO still expectation with respect to 60 00:03:28,393 --> 00:03:30,716 our relational approximation. 61 00:03:30,716 --> 00:03:35,869 And this means that, we need to obtain unbiased stochastic gradient. 62 00:03:35,869 --> 00:03:40,622 We could remove this integral with its unbiased Monte Carlo estimate. 63 00:03:40,622 --> 00:03:45,582 For this purpose, we also need to perform reparameterization trick in order to 64 00:03:45,582 --> 00:03:48,479 reduce the variance of stochastic gradient. 65 00:03:48,479 --> 00:03:53,800 So then we may simply sample from the distribution, which is now parameter-free. 66 00:03:53,800 --> 00:03:56,761 And use Monte Carlo estimate to compute gradients. 67 00:03:56,761 --> 00:04:00,270 Another good property is that the richer is variational family, 68 00:04:00,270 --> 00:04:03,589 the better we approximate the true posterior distribution, 69 00:04:03,589 --> 00:04:05,719 so we do not have a risk of overfitting. 70 00:04:05,719 --> 00:04:08,230 The more operational parameters we have, 71 00:04:08,230 --> 00:04:11,448 the closer we are to the true posterior distribution. 72 00:04:11,448 --> 00:04:12,135 And finally, 73 00:04:12,135 --> 00:04:15,921 we can split ELBO into two parts by splitting our arithmetic side of integral. 74 00:04:15,921 --> 00:04:18,931 So then there will be two integrals, two terms. 75 00:04:18,931 --> 00:04:23,405 The first term is called data term, and it simply shows the expectation with respect 76 00:04:23,405 --> 00:04:26,631 to our relational approximation of training a likelihood. 77 00:04:26,631 --> 00:04:30,986 And the second term is negative KL diversions between our relational 78 00:04:30,986 --> 00:04:33,849 approximation and the prior distribution. 79 00:04:33,849 --> 00:04:35,838 Even now if I get the second term, 80 00:04:35,838 --> 00:04:40,316 we optimize just the first term with respect all possible distributions, 81 00:04:40,316 --> 00:04:43,885 we'll end up with a delta function at maximal group point. 82 00:04:43,885 --> 00:04:48,080 So there'll be a delta function at WML. 83 00:04:48,080 --> 00:04:48,808 And the second term, 84 00:04:48,808 --> 00:04:51,047 the regularizer prevents us from collapsing to delta function. 85 00:04:51,047 --> 00:04:55,090 It penalizes too much deviations from prior distribution. 86 00:04:55,090 --> 00:04:58,349 It optimize both terms with respect to all possible distributions, 87 00:04:58,349 --> 00:05:00,773 we'll end up with a true posterior distribution. 88 00:05:00,773 --> 00:05:03,570 But since the true posterior is untraceable, 89 00:05:03,570 --> 00:05:07,380 we'll limit the set of possible relational approximations. 90 00:05:07,380 --> 00:05:11,865 And then they'll end up with a variational distribution, which is the closest in 91 00:05:11,865 --> 00:05:15,189 terms of KL divergence to our true posterior distribution. 92 00:05:15,189 --> 00:05:18,526 So to conclude, this is stochastic relational inference. 93 00:05:18,526 --> 00:05:23,316 This is a highly scalable technique that provides us with the person with 94 00:05:23,316 --> 00:05:24,819 Bayesian inference. 95 00:05:24,819 --> 00:05:27,190 The usage of stochastic optimization and 96 00:05:27,190 --> 00:05:31,396 reparameterization trick makes SVI very applicable to large data sets. 97 00:05:31,396 --> 00:05:33,802 And in the next section, I will tell you about dropout and 98 00:05:33,802 --> 00:05:36,121 how it can be interpreted from Bayesian point of view. 99 00:05:38,021 --> 00:05:48,021 [MUSIC]