1 00:00:00,000 --> 00:00:00,643 [MUSIC] 2 00:00:00,643 --> 00:00:06,175 Before we move on to the [INAUDIBLE] allocation, 3 00:00:06,175 --> 00:00:11,444 let's see what Dirichlet distribution is. 4 00:00:11,444 --> 00:00:13,935 So it's probability density function is given as follows. 5 00:00:13,935 --> 00:00:18,834 It is a distribution over the vector theta. 6 00:00:18,834 --> 00:00:21,926 Its components should solve to 1 and b non-negative. 7 00:00:21,926 --> 00:00:24,723 This is called a simplex. 8 00:00:24,723 --> 00:00:28,759 A really convenient way to interpret this is a triangle. 9 00:00:28,759 --> 00:00:32,711 So I have a triangle with three nodes, they will correspond to 10 00:00:32,711 --> 00:00:37,654 the coordinates (1,0,0), (0,0,1), and (0,1,0). 11 00:00:37,654 --> 00:00:44,918 And the vector theta will correspond to the barycentric coordinates of the point. 12 00:00:44,918 --> 00:00:51,888 For example, this red points would have the coordinance 0.3, 0.1, and 0.5. 13 00:00:51,888 --> 00:00:56,944 Since first and the third coordinates are large, 14 00:00:56,944 --> 00:01:01,767 this point is near the left and the upper nodes. 15 00:01:01,767 --> 00:01:07,052 So the distribution is parameterized by parameter alpha, which is also a vector. 16 00:01:07,052 --> 00:01:11,029 We assume that it's [INAUDIBLE] non-negative. 17 00:01:11,029 --> 00:01:16,182 The probability density function is some normalization constant one of over 18 00:01:16,182 --> 00:01:21,109 meta function of alpha times, a product over coordinates of the vector. 19 00:01:21,109 --> 00:01:26,397 The corresponding coordinate theta K power alpha k-1. 20 00:01:26,397 --> 00:01:32,097 So by varying the parameter alpha, we can get different shapes of the distribution. 21 00:01:32,097 --> 00:01:34,773 For example, if all alphas are less than 1. 22 00:01:34,773 --> 00:01:39,265 We will have some ball-shaped distribution. 23 00:01:39,265 --> 00:01:43,467 And the most probable positions would be sparse vectors, 24 00:01:43,467 --> 00:01:48,281 that have one coordinate that is large, and the others are small. 25 00:01:48,281 --> 00:01:52,997 If on the other hand all offers are greater than 1, 26 00:01:52,997 --> 00:01:56,407 we will have a unimal distribution. 27 00:01:56,407 --> 00:02:01,390 We can also have different values of coordinates of alpha. 28 00:02:01,390 --> 00:02:07,238 For example, if alpha is 5, 2, 2, as shown left, the distribution would be 29 00:02:07,238 --> 00:02:12,573 concentrated around the first node, that has coordinates 1, 0, 0. 30 00:02:12,573 --> 00:02:16,563 If for example we have alpha equal to 5, 5, 2, 31 00:02:16,563 --> 00:02:23,310 then the distribution will be concentrated around the bottom edge of the triangle. 32 00:02:23,310 --> 00:02:27,674 The statistics of this distribution are given as follows. 33 00:02:27,674 --> 00:02:32,607 For example, the mean of the expected value of the coordinate i, 34 00:02:32,607 --> 00:02:37,630 the expected value of theta i, can be obtained as the ratio between 35 00:02:37,630 --> 00:02:42,230 corresponding alpha i over the sum over all values of alpha. 36 00:02:42,230 --> 00:02:43,420 This is alpha0. 37 00:02:43,420 --> 00:02:46,884 And the covariance is given as follows. 38 00:02:46,884 --> 00:02:52,195 Also note that when k is equal to 2, we get a beta distribution. 39 00:02:52,195 --> 00:02:58,087 So the beta distribution is a case the Dirichlet distribution, 40 00:02:58,087 --> 00:03:01,261 when we have only two dimensions. 41 00:03:01,261 --> 00:03:02,993 All right, as always, 42 00:03:02,993 --> 00:03:08,118 let's see how we can apply this distribution in a real world example. 43 00:03:08,118 --> 00:03:12,194 Imagine that you develop an online game, and 44 00:03:12,194 --> 00:03:15,951 the characters can select the strength, 45 00:03:15,951 --> 00:03:20,472 the stamina, and the speed of their characters. 46 00:03:20,472 --> 00:03:26,168 So the player has some credit that equals to 1, and 47 00:03:26,168 --> 00:03:31,743 he can distribute it for these three criterions. 48 00:03:31,743 --> 00:03:38,557 For example, here, the player1 assigns more credit for 49 00:03:38,557 --> 00:03:44,405 the strength, and so his play would be stronger. 50 00:03:44,405 --> 00:03:48,328 However, he compromise for the stamina and speed. 51 00:03:48,328 --> 00:03:54,306 The second player however, assigned equal amounts of values for 52 00:03:54,306 --> 00:03:59,319 stamina speed, and he compromised for the strength. 53 00:03:59,319 --> 00:04:04,590 The third player assigned equal credit for all three criterions. 54 00:04:04,590 --> 00:04:08,911 And so if we collect the statistics over all our players, 55 00:04:08,911 --> 00:04:11,618 we could have something like this. 56 00:04:11,618 --> 00:04:15,642 It is really intuitive to model this using the Dirichlet distribution. 57 00:04:15,642 --> 00:04:19,830 We could estimate the parameter alpha from our dataset, 58 00:04:19,830 --> 00:04:23,767 from the statistics that we gather from the players. 59 00:04:23,767 --> 00:04:27,216 And we can estimate it to be for example, 3, 1, 1. 60 00:04:27,216 --> 00:04:30,638 This would mean that in your game, 61 00:04:30,638 --> 00:04:36,658 most of the players prefer stamina over strength and speed. 62 00:04:36,658 --> 00:04:40,528 So actually, there is one more property that I want to tell you about. 63 00:04:40,528 --> 00:04:47,971 The Dirichlet prior is actually a conjugate to the multinomial likelihood. 64 00:04:47,971 --> 00:04:52,403 If you don't remember, let me remind you what conjugate prior is. 65 00:04:52,403 --> 00:04:59,351 So the prior p P of theta is conjugate to the likelihood P of X given theta. 66 00:04:59,351 --> 00:05:03,901 If the posterior lies in the same family of distributions as the prior. 67 00:05:03,901 --> 00:05:08,391 So here's our multinomial likelihood. 68 00:05:08,391 --> 00:05:11,805 It equals to some normalization constant 69 00:05:11,805 --> 00:05:16,700 times the product of all coordinates, theta i, power xi. 70 00:05:16,700 --> 00:05:19,766 So this is a distribution over counts. 71 00:05:19,766 --> 00:05:23,189 You will have for example a dice that has six sides. 72 00:05:23,189 --> 00:05:24,509 The key here would be equal to 6. 73 00:05:24,509 --> 00:05:28,588 We will have like six possible outcomes. 74 00:05:28,588 --> 00:05:35,985 And x, for example x1, would be equal to the number of times we had number 1. 75 00:05:35,985 --> 00:05:43,971 And also n equals to the number of times that we conducted our experiment. 76 00:05:43,971 --> 00:05:49,151 So actually all x case should sum up to n. 77 00:05:49,151 --> 00:05:51,622 And here is our prior, Dirichlet prior. 78 00:05:51,622 --> 00:05:53,898 We will try to compute the posterior. 79 00:05:53,898 --> 00:05:57,904 We would multiply the likelihood and the prior. 80 00:05:57,904 --> 00:06:03,124 And so the posterior would be proportional to the following function. 81 00:06:03,124 --> 00:06:06,730 Product of theta k power alpha k, plus xk minus 1. 82 00:06:06,730 --> 00:06:13,087 We obtain this formula by rearranging the terms after multiplication. 83 00:06:13,087 --> 00:06:16,402 And also notice that it has actually a beta distribution up to a normalization 84 00:06:16,402 --> 00:06:16,912 constant. 85 00:06:16,912 --> 00:06:17,927 All right, so 86 00:06:17,927 --> 00:06:23,264 we can compute the posterior by multiplying the likelihood and the prior. 87 00:06:23,264 --> 00:06:26,066 If we rearrange the terms, we'll get the following formula. 88 00:06:26,066 --> 00:06:29,061 This will be the product over all dimensions. 89 00:06:29,061 --> 00:06:34,464 The probability of the corresponding dimension power alpha k plus xk minus 1. 90 00:06:34,464 --> 00:06:38,162 Now, hear that it is actually a Dirichlet distribution, 91 00:06:38,162 --> 00:06:40,280 up to a normalization Gaussian. 92 00:06:40,280 --> 00:06:45,418 And so the posterior is actually a Dirichlet distribution over theta. 93 00:06:45,418 --> 00:06:49,637 And the vector of parameters would be obtained as alpha k plus xk, 94 00:06:49,637 --> 00:06:50,875 at each position. 95 00:06:50,875 --> 00:06:54,785 So we just sum up the two vectors. 96 00:06:54,785 --> 00:07:04,785 [MUSIC]