1 00:00:03,530 --> 00:00:08,195 Welcome to week two of practical Bayesian methods. 2 00:00:08,195 --> 00:00:12,765 I'm Alexander Novikov and this week we're going to cover the Latent Variable Models. 3 00:00:12,765 --> 00:00:13,820 So what is the latent variables? 4 00:00:13,820 --> 00:00:15,480 Why do we need them? 5 00:00:15,480 --> 00:00:18,045 And how to apply them to real problems? 6 00:00:18,045 --> 00:00:20,490 And the second topic for this week, 7 00:00:20,490 --> 00:00:23,715 is the Expectation Maximization algorithm, 8 00:00:23,715 --> 00:00:26,550 which is the key topic of 9 00:00:26,550 --> 00:00:30,600 our course and this is a method to train latent variable models. 10 00:00:30,600 --> 00:00:32,820 We will see numerous extensions of 11 00:00:32,820 --> 00:00:36,960 this expectation maximization algorithm in the following weeks. 12 00:00:36,960 --> 00:00:40,270 So, let's get started with the latent variable models. 13 00:00:40,270 --> 00:00:43,845 Latent variable is just a random variable 14 00:00:43,845 --> 00:00:49,820 which is unobservable to you nor in training nor in test phase. 15 00:00:49,820 --> 00:00:57,275 So, latent is just hidden from Latin and as an example of those you can think. 16 00:00:57,275 --> 00:01:02,040 So for example, some phenomenons like heights or lengths 17 00:01:02,040 --> 00:01:07,125 or maybe speed can measure directly and some others can not. 18 00:01:07,125 --> 00:01:12,185 For example, incidences or altruism. 19 00:01:12,185 --> 00:01:19,190 You can't just measure altruism with some quantitative scale. 20 00:01:19,190 --> 00:01:22,700 And so these variables are usually called latent. 21 00:01:22,700 --> 00:01:28,135 And to motivate why do we need to introduce this concept into probabilistic modeling. 22 00:01:28,135 --> 00:01:30,902 Let's consider the following example, 23 00:01:30,902 --> 00:01:36,341 see you have an IT company and you want to hire an employee, 24 00:01:36,341 --> 00:01:38,605 so you have a bunch of candidates, 25 00:01:38,605 --> 00:01:41,695 and for each candidate, you have some data on them. 26 00:01:41,695 --> 00:01:46,515 So for example, for all of them you have their average high school grades, 27 00:01:46,515 --> 00:01:49,545 for some of them you have their university grades 28 00:01:49,545 --> 00:01:54,705 and maybe some of them took IQ tests and stuff like that. 29 00:01:54,705 --> 00:01:58,575 And you also conducted a phone screening interview. 30 00:01:58,575 --> 00:02:03,090 So, your HR manager called each of them and ask them a bunch of simple questions 31 00:02:03,090 --> 00:02:08,765 to make sure that they understand what your company is about. 32 00:02:08,765 --> 00:02:13,636 Now, you want to bring these people onsite to make an actual technical interview, 33 00:02:13,636 --> 00:02:17,400 but the problem is that you have too many candidates. 34 00:02:17,400 --> 00:02:20,395 You can't invite all of them because it's expensive. 35 00:02:20,395 --> 00:02:21,720 You have to pay for their flights, 36 00:02:21,720 --> 00:02:24,390 for their hotels and stuff like that. 37 00:02:24,390 --> 00:02:25,692 So, natural idea arises, 38 00:02:25,692 --> 00:02:29,900 let's predict the onsite interview performance for each of 39 00:02:29,900 --> 00:02:34,307 them and bring only those who are predicted to be good enough, 40 00:02:34,307 --> 00:02:37,035 so how to predict to be a good fit for our company. 41 00:02:37,035 --> 00:02:39,390 Well, if you're in the business for a while, 42 00:02:39,390 --> 00:02:41,420 you may have some historical data. 43 00:02:41,420 --> 00:02:43,680 So, for a bunch of other people, 44 00:02:43,680 --> 00:02:49,265 you can know their features like their grades and their IQ scores, 45 00:02:49,265 --> 00:02:55,500 and you know their onsite performs because you have already conducted these interviews. 46 00:02:55,500 --> 00:02:58,290 Now, you have a standard regression problem. 47 00:02:58,290 --> 00:03:01,043 You have a training data set of circle data, 48 00:03:01,043 --> 00:03:03,825 and for new people you want to predict their onsite performance, 49 00:03:03,825 --> 00:03:07,737 and you want to bring on their onsite interviews only 50 00:03:07,737 --> 00:03:13,615 those whose predicted onsite performance is good. 51 00:03:13,615 --> 00:03:16,950 However, there are two main problems why we can't apply 52 00:03:16,950 --> 00:03:22,475 here the standard regression methods from machine learning. 53 00:03:22,475 --> 00:03:25,581 So first of all, we have missing values. 54 00:03:25,581 --> 00:03:28,290 For example, we don't know university grades for all of 55 00:03:28,290 --> 00:03:31,305 them because Jack didn't attend university. 56 00:03:31,305 --> 00:03:33,955 And it doesn't mean that he is not a good fit for your company. 57 00:03:33,955 --> 00:03:36,965 Maybe he is but he just never bothered to attend one. 58 00:03:36,965 --> 00:03:39,684 So we didn't want to ignore Jack but we want to 59 00:03:39,684 --> 00:03:44,885 anyway predict for him some meaningful onsite and field performance score. 60 00:03:44,885 --> 00:03:48,030 And the second reason why we don't want to use 61 00:03:48,030 --> 00:03:52,560 some standard regression methods like linear regression or neural networks, 62 00:03:52,560 --> 00:03:57,175 is that we may want to quantify uncertainty in our predictions. 63 00:03:57,175 --> 00:04:00,135 So, imagine that for some people, 64 00:04:00,135 --> 00:04:04,160 we may predict that their performance is really good and we 65 00:04:04,160 --> 00:04:09,655 certainly want to bring them onsite and maybe even want to hire them right away. 66 00:04:09,655 --> 00:04:12,655 But for others, the predict performance is not good. 67 00:04:12,655 --> 00:04:17,640 And for someone, the predict performance can be for example of 50, 68 00:04:17,640 --> 00:04:21,805 which may mean that this person is not a good fit for your company. 69 00:04:21,805 --> 00:04:26,453 But it may also mean that we're just not sure about him. 70 00:04:26,453 --> 00:04:30,240 So, we don't know anything about him and you asked the algorithm to predict 71 00:04:30,240 --> 00:04:35,935 his performance and he returned you some number but it doesn't mean anything. 72 00:04:35,935 --> 00:04:37,410 So in this case, 73 00:04:37,410 --> 00:04:42,130 we may want to quantify the uncertainty of the algorithm in the predictions. 74 00:04:42,130 --> 00:04:45,465 So, if the algorithm is quite sure that this person 75 00:04:45,465 --> 00:04:49,830 will perform at a level of 50 out of 100 for example, 76 00:04:49,830 --> 00:04:52,810 then we may not want to bring him onsite. 77 00:04:52,810 --> 00:04:56,880 On the other hand, if some other guy predicted performance is also 78 00:04:56,880 --> 00:05:01,836 50 but we're really uncertain about his performance, 79 00:05:01,836 --> 00:05:07,535 then we may want to bring him anyway and see maybe we just don't know anything about him. 80 00:05:07,535 --> 00:05:10,630 So, he may be good after all. 81 00:05:10,630 --> 00:05:13,205 And the reason for this uncertainty may be for example that, 82 00:05:13,205 --> 00:05:18,350 he has lots of missing values or maybe his data is a little bit contradictory or 83 00:05:18,350 --> 00:05:25,110 maybe our algorithms just aren't used to see people like that. 84 00:05:25,110 --> 00:05:27,050 So these two reasons, 85 00:05:27,050 --> 00:05:31,070 having missing values and wanting to quantify uncertainty, 86 00:05:31,070 --> 00:05:36,690 bring us to the needs of for probabilistic modelling of the data. 87 00:05:36,690 --> 00:05:38,510 And as we discussed in week one, 88 00:05:38,510 --> 00:05:43,910 one of the usual way to build probabilistic model is to start 89 00:05:43,910 --> 00:05:45,140 with drawing some random variables 90 00:05:45,140 --> 00:05:51,020 and then understanding what are the connections between these random variables. 91 00:05:51,020 --> 00:05:56,140 So, which random variables correlate with each other in some way? 92 00:05:56,140 --> 00:05:58,010 And in this particular case, 93 00:05:58,010 --> 00:06:01,180 it looks like everything is connected to everything. 94 00:06:01,180 --> 00:06:05,555 Like if a person's university grades are high, 95 00:06:05,555 --> 00:06:08,420 it directly influences our beliefs about 96 00:06:08,420 --> 00:06:14,915 his high school grades or his IQ score and this is true for any pair of variables here. 97 00:06:14,915 --> 00:06:19,565 And the station where we have all possible edges, 98 00:06:19,565 --> 00:06:21,410 like everything is connected to everything, 99 00:06:21,410 --> 00:06:26,430 means that we've failed to capture the structure of our probabilistic model. 100 00:06:26,430 --> 00:06:29,180 So we end up with the most flexible and 101 00:06:29,180 --> 00:06:32,705 the least structured model that we can possibly have. 102 00:06:32,705 --> 00:06:35,040 And in this situation, 103 00:06:35,040 --> 00:06:37,670 we have to assume. 104 00:06:37,670 --> 00:06:40,820 So to build up a probabilistic model of our data now, 105 00:06:40,820 --> 00:06:46,285 we have to assign probability to each possible combination of our features. 106 00:06:46,285 --> 00:06:52,310 So there are exponentially many combinations of different university grades, 107 00:06:52,310 --> 00:06:54,410 different IQ scores and stuff like that. 108 00:06:54,410 --> 00:06:57,000 And for each of them we have to assign a probability. 109 00:06:57,000 --> 00:07:00,300 And this tables of probability 110 00:07:00,300 --> 00:07:03,923 has been billions of entries and it's just impractical to treat as parameters, 111 00:07:03,923 --> 00:07:07,635 to treat these probabilities as parameters. 112 00:07:07,635 --> 00:07:13,740 So, we have to do something else but we always can assume some parametric model, right? 113 00:07:13,740 --> 00:07:18,189 We can say that we have these five random variables 114 00:07:18,189 --> 00:07:23,240 and that probability for any combination of them is some simple function. 115 00:07:23,240 --> 00:07:26,828 For example, exponent of linear function divided by normalization constant. 116 00:07:26,828 --> 00:07:31,250 In this case, you reduce your model complexity by a lot. 117 00:07:31,250 --> 00:07:35,185 Now, we have just like five parameters which we want to train. 118 00:07:35,185 --> 00:07:38,795 But the problem here is with the normalization constant. 119 00:07:38,795 --> 00:07:40,670 So, to normalize this thing, 120 00:07:40,670 --> 00:07:45,800 so it will be a proper probability and it will sum up to one, 121 00:07:45,800 --> 00:07:48,920 we have to consider the normalization constant which 122 00:07:48,920 --> 00:07:52,815 is the sum of all possible configurations. 123 00:07:52,815 --> 00:07:54,725 And this is a gigantic sum. 124 00:07:54,725 --> 00:07:59,485 We have to consider all billions of possible configurations to 125 00:07:59,485 --> 00:08:05,690 compute it and this means that the training and inference will be impractical. 126 00:08:05,690 --> 00:08:07,600 So what else can we do here? 127 00:08:07,600 --> 00:08:11,250 Well, it turns out that you can introduce a new variable 128 00:08:11,250 --> 00:08:15,305 which you don't actually have in your model of the world, 129 00:08:15,305 --> 00:08:17,675 which is called intelligence. 130 00:08:17,675 --> 00:08:20,370 So you can assume that each person has 131 00:08:20,370 --> 00:08:24,315 some internal and hidden property of him 132 00:08:24,315 --> 00:08:29,560 which we will call intelligence and for example measure on the scale from one to 100. 133 00:08:29,560 --> 00:08:32,700 This intelligence directly causes each of 134 00:08:32,700 --> 00:08:37,340 these IQ scores and university grades and stuff like that. 135 00:08:37,340 --> 00:08:40,665 Of course, this connection is non-deterministic. 136 00:08:40,665 --> 00:08:46,195 So an intelligent person can have a bad day and write test poorly. 137 00:08:46,195 --> 00:08:48,636 But this is direct causation, 138 00:08:48,636 --> 00:08:51,925 so intelligence directly causes all these observations. 139 00:08:51,925 --> 00:08:53,820 And if I assume such a model, 140 00:08:53,820 --> 00:08:57,240 then we reduce the model complexity by a lot. 141 00:08:57,240 --> 00:09:03,400 We raised lots of features and now our model is much simpler to work with. 142 00:09:03,400 --> 00:09:05,175 So now, we're going to write 143 00:09:05,175 --> 00:09:09,840 our probabilistic model by using the rule of sum of probabilities, 144 00:09:09,840 --> 00:09:13,035 it's the sum of 145 00:09:13,035 --> 00:09:18,825 all possible configurations given the intelligence times the prior probability. 146 00:09:18,825 --> 00:09:22,815 And these are conditional probability 147 00:09:22,815 --> 00:09:28,820 factorizes into product of small probabilities because of the structure of our model. 148 00:09:28,820 --> 00:09:35,865 So now, instead of one huge table with all the combinations for five different features, 149 00:09:35,865 --> 00:09:39,960 we have just five small tables that assigns probabilities 150 00:09:39,960 --> 00:09:47,000 to a pair of like IQ score given intelligence. 151 00:09:47,000 --> 00:09:51,730 This means that they're able to reduce the model complexity and now to model 152 00:09:51,730 --> 00:09:58,280 without reducing the flexibility of them all. 153 00:09:58,280 --> 00:10:05,095 So to summarize, introducing latent variables may simplify our model. 154 00:10:05,095 --> 00:10:09,634 So it can reduce the number of phases we have. 155 00:10:09,634 --> 00:10:12,870 And as a consequence of that, 156 00:10:12,870 --> 00:10:15,745 we can reduce the number of parameters. 157 00:10:15,745 --> 00:10:20,475 And some other positive feature of latent variables, 158 00:10:20,475 --> 00:10:23,653 is that they are sometimes interpretable. 159 00:10:23,653 --> 00:10:27,750 So for example, these intelligence variable, we can, 160 00:10:27,750 --> 00:10:30,660 for a new person we can estimate his intelligence on the scale from 161 00:10:30,660 --> 00:10:34,330 one to 100 and for example it can be 80. 162 00:10:34,330 --> 00:10:35,700 What does that mean? 163 00:10:35,700 --> 00:10:38,310 Well, it's not obvious because you don't know 164 00:10:38,310 --> 00:10:40,950 what the scale means and you're not even sure that 165 00:10:40,950 --> 00:10:43,634 this intelligence means actual intelligence 166 00:10:43,634 --> 00:10:47,205 because they never told you model that these variables should be intelligence, 167 00:10:47,205 --> 00:10:51,330 you just said that there should be some variable here. 168 00:10:51,330 --> 00:10:55,260 But anyway, this variable can be interpretable and you can compare 169 00:10:55,260 --> 00:10:59,830 intelligence according to this scale of different people in your data set. 170 00:10:59,830 --> 00:11:06,035 And some downside of latent variable models is that they can be harder to work with. 171 00:11:06,035 --> 00:11:07,875 So, training latent variable model, 172 00:11:07,875 --> 00:11:11,250 you have to rely on a lot math. 173 00:11:11,250 --> 00:11:15,040 And this math is, 174 00:11:15,040 --> 00:11:16,830 what this week is all about. 175 00:11:16,830 --> 00:11:18,130 So in the next videos, 176 00:11:18,130 --> 00:11:22,000 we will discuss methods for training latent variable models.