1 00:00:00,000 --> 00:00:02,725 In this lecture, I'll introduce belief nets. 2 00:00:02,725 --> 00:00:07,592 One of the reason I abandoned back propagation in the 1990's is because it 3 00:00:07,592 --> 00:00:11,810 required too many labels. Back then, we just didn't have data sets 4 00:00:11,810 --> 00:00:16,547 with sufficient numbers of labels. I was also influenced by the fact that 5 00:00:16,547 --> 00:00:20,052 people managed to learn with very few explicit labels. 6 00:00:20,052 --> 00:00:24,854 However, I didn't want to abandon the advantages of doing gradient descend 7 00:00:24,854 --> 00:00:27,644 learning to learn a whole bunch of weights. 8 00:00:27,644 --> 00:00:32,251 So the issue was, was there another objective function that we could do 9 00:00:32,251 --> 00:00:36,252 grading decentive? The obvious place to look was generative 10 00:00:36,252 --> 00:00:41,434 models where the objective function is to model the input data rather than 11 00:00:41,434 --> 00:00:45,580 predicting a label. This meshed nicely with a major movement 12 00:00:45,580 --> 00:00:50,140 in statistics and artificial intelligence called the graphical models. 13 00:00:50,140 --> 00:00:55,348 The idea of graphical models was to combine discrete graph structures for 14 00:00:55,348 --> 00:00:58,938 representing how variables depended on one another. 15 00:00:58,938 --> 00:01:04,287 With real valued computations that inferred the probability of one variable, 16 00:01:04,287 --> 00:01:07,454 given the observed values of other variables. 17 00:01:07,454 --> 00:01:12,733 Boltzmann Machines were actually a very early example of a graphical model, 18 00:01:12,733 --> 00:01:18,463 But they were undirected graphical models. In 1992, Radford Neal pointed out that 19 00:01:18,463 --> 00:01:23,864 using the same kinds of units as we used in Boltzmann machines, we could make 20 00:01:23,864 --> 00:01:28,390 directed graphical models which he called Sigmoid Belief Nets. 21 00:01:28,390 --> 00:01:33,820 And the issue then became, how can we learn Sigmoid belief nets? 22 00:01:34,100 --> 00:01:38,894 The second problem is that for deep networks, the learning time does not scale 23 00:01:38,894 --> 00:01:41,784 well. When there's multiple hidden layers, the 24 00:01:41,784 --> 00:01:44,857 learning was very slow. You might ask why this was, 25 00:01:44,857 --> 00:01:49,959 And we now know that one of the reasons was we did not initialize the weights in a 26 00:01:49,959 --> 00:01:52,726 sensible way. Yet, another problem is the back 27 00:01:52,726 --> 00:01:55,615 propagation can get stuck in poor local optima. 28 00:01:55,615 --> 00:01:59,119 These are often quite good, so back propagation is useful. 29 00:01:59,119 --> 00:02:04,037 But we can now show that for deep nets, the local optima you get stuck in, if you 30 00:02:04,037 --> 00:02:07,910 start with small random weights are typically far from optimal. 31 00:02:07,910 --> 00:02:12,420 There is the possibility of retreating to simpler models that allow convex 32 00:02:12,420 --> 00:02:15,488 optimization. But, I don't think this is a good idea. 33 00:02:15,488 --> 00:02:19,156 Mathematicians like to do that because they can prove things. 34 00:02:19,156 --> 00:02:23,607 But in practice, you're just running away from the complexity of real data. 35 00:02:23,607 --> 00:02:28,358 So, one way to overcome the limits of back propagation is by using unsupervised 36 00:02:28,358 --> 00:02:31,275 learning. The idea is that we want to keep the 37 00:02:31,275 --> 00:02:36,284 efficiency and simplicity of using a gradient method and stochastic mini batch 38 00:02:36,284 --> 00:02:40,522 descent for adjusting weights. But, we're going to use that method for 39 00:02:40,522 --> 00:02:45,274 modeling the structure of the sensory input, not for modeling the relation 40 00:02:45,274 --> 00:02:50,395 between input and output. So the idea is, the weights are going to 41 00:02:50,395 --> 00:02:55,225 be adjusted to maximize the probability that a generative model would have 42 00:02:55,225 --> 00:02:59,540 generated the sensory input. We already saw that in learning Boltzmann 43 00:02:59,540 --> 00:03:04,036 machines. And one way to think about it is, if you 44 00:03:04,036 --> 00:03:08,764 want to do computer vision, you should first learn to do computer graphics. 45 00:03:08,764 --> 00:03:13,046 To first order, computer graphics works and computer vision doesn't. 46 00:03:13,046 --> 00:03:17,902 The learning objective for a generative model, as we saw with Boltzmann machines, 47 00:03:17,902 --> 00:03:22,503 is to maximize the probability of the observed data not to maximize the 48 00:03:22,503 --> 00:03:27,517 probability of labels given inputs. Then the question arises, what kind of 49 00:03:27,517 --> 00:03:32,799 generative model should we learn? We might learn an energy based model like 50 00:03:32,799 --> 00:03:37,598 the Boltzmann machine, Or we might learn a causal model made of 51 00:03:37,598 --> 00:03:41,120 idealized neurons, and that's what we'll look at first. 52 00:03:41,620 --> 00:03:46,318 Well finally, we might learn some kind of hybrid of the two, and that's where we'll 53 00:03:46,318 --> 00:03:49,643 end up. So, before I go into causal belief nets 54 00:03:49,643 --> 00:03:54,673 made of neurons, I want to give you a little bit of background about artificial 55 00:03:54,673 --> 00:03:59,436 intelligence and probability. In the 1970's and early 1980's, people in 56 00:03:59,436 --> 00:04:03,085 artificial intelligence were unbelievably anti-probability. 57 00:04:03,085 --> 00:04:07,725 When I was a graduate student, if you mentioned probability, it was assigned 58 00:04:07,725 --> 00:04:11,003 that you were stupid and that you just hadn't got it. 59 00:04:11,003 --> 00:04:15,890 Computers were all about discrete single processing, and if you'd introduce any 60 00:04:15,890 --> 00:04:18,860 probabilities they would just infect everything. 61 00:04:19,140 --> 00:04:24,958 It's hard to conceive of how much people are against probability, so here's a quote 62 00:04:24,958 --> 00:04:28,255 to help you. I'll read it out. 63 00:04:28,255 --> 00:04:33,251 Many ancient Greeks supported Socrates opinion that deep, inexplicable thoughts 64 00:04:33,251 --> 00:04:36,982 came from the gods. Today's equivelant to those gods is the 65 00:04:36,982 --> 00:04:41,220 erratic, even probabilistic neuron. It is more likely that increased 66 00:04:41,220 --> 00:04:46,026 randomness of neural behavior is the problem of the epileptic and the drunk, 67 00:04:46,026 --> 00:04:52,281 not the advantage of the brilliant. That was in Patrick Henry Winston's first 68 00:04:52,281 --> 00:04:57,456 AI textbook, in the first edition. And it was the general opinion at the 69 00:04:57,456 --> 00:05:00,736 time. Winston was to become the leader of the 70 00:05:00,736 --> 00:05:03,506 MIT AI Lab. Here's an alternative view. 71 00:05:03,506 --> 00:05:09,265 All of this will lead to theories of computation which are much less rigidly of 72 00:05:09,265 --> 00:05:13,420 an all-or-none nature than past and present formal logic. 73 00:05:14,020 --> 00:05:18,860 There are numerous indications to make us believe that this new system of formal 74 00:05:18,860 --> 00:05:23,462 logic will move closer to another discipline which has been little linked in 75 00:05:23,462 --> 00:05:27,048 the past with logic. This is thermodynamics, primarily in the 76 00:05:27,048 --> 00:05:32,282 form it was received from Boltzmann. That was written by John von Neumann in 77 00:05:32,282 --> 00:05:36,810 1957, and was part of the unfinished manuscript he left behind for what was to 78 00:05:36,810 --> 00:05:40,700 be his crowning achievement, His book on The Computer and the Brain. 79 00:05:41,540 --> 00:05:45,432 I think if von Neumann had lived, the history of artificial intelligence might 80 00:05:45,432 --> 00:05:50,262 have been somewhat different. So, probabilities eventually found their 81 00:05:50,262 --> 00:05:53,816 way into AI by something called graphical models, 82 00:05:53,816 --> 00:05:58,081 Which are a marriage of graph theory and probability theory. 83 00:05:58,081 --> 00:06:03,766 In the 1980's, there was a lot of work on expert systems in AI that use bags of 84 00:06:03,766 --> 00:06:08,600 rules for tasks such as, medical diagnosis or exploring for minerals. 85 00:06:08,900 --> 00:06:12,700 Now, these were practical problems so they had to deal with uncertainty. 86 00:06:12,700 --> 00:06:16,180 They couldn't just use toy examples where everything was certain. 87 00:06:16,560 --> 00:06:21,935 People in AI dislike probability so much that even when they were dealing with 88 00:06:21,935 --> 00:06:25,337 uncertainty, they didn't want to use probabilities. 89 00:06:25,337 --> 00:06:30,917 So, they made up their own ways of dealing with uncertainties that did not involve 90 00:06:30,917 --> 00:06:34,727 probabilities. You can actually prove that this is a bad 91 00:06:34,727 --> 00:06:38,191 bet. Graphical models were introduced by Pearl, 92 00:06:38,191 --> 00:06:43,374 Heckman, Lauritz and many others who shared that probabilities actually worked 93 00:06:43,374 --> 00:06:48,159 better than the ad hoc methods developed by people doing expert systems. 94 00:06:48,159 --> 00:06:53,475 Discrete graphs were good for representing what variable dependent on what other 95 00:06:53,475 --> 00:06:57,189 variables. But once you have those graphs, you then 96 00:06:57,189 --> 00:07:02,565 needed to do real value computations that respected the rules of probability so that 97 00:07:02,565 --> 00:07:07,309 you could compute the expected values of some nodes in the graph, given the 98 00:07:07,309 --> 00:07:12,210 observed states of other nodes. Belief nets is the name that people in 99 00:07:12,210 --> 00:07:17,788 graphical models give to a particular subset of graphs which are directed 100 00:07:17,788 --> 00:07:22,085 acyclic graphs. And typically, they use sparsely connected 101 00:07:22,085 --> 00:07:24,950 ones. And if those graphs are sparsely 102 00:07:24,950 --> 00:07:30,152 connected, they have clever inference algorithms that can compute the 103 00:07:30,152 --> 00:07:33,620 probabilities of unobserved nodes efficiently. 104 00:07:33,900 --> 00:07:38,769 But, these clever of algorithms are exponential in the number of nodes that 105 00:07:38,769 --> 00:07:43,120 influence each node, so they won't work for densely connected nodes. 106 00:07:43,520 --> 00:07:48,612 So, belief net is directed acyclic graph composed of stochastic variables, 107 00:07:48,612 --> 00:07:54,120 And here's a picture of one. In general, you might observe any of the 108 00:07:54,120 --> 00:07:57,692 variables. I'm going to restrict myself to nets in 109 00:07:57,692 --> 00:08:04,971 which you only observe the leaf nodes. So, we imagine is these unobserved hidden 110 00:08:04,971 --> 00:08:10,726 causes, and they may be lead, And they eventually give rise to some 111 00:08:10,726 --> 00:08:15,182 observed effects. Once we observe some variables, there's 112 00:08:15,182 --> 00:08:20,269 two problems we'd like to solve. The first is what I call the inference 113 00:08:20,269 --> 00:08:24,064 problem, and that's to infer the states of unobserved variables. 114 00:08:24,064 --> 00:08:28,160 Of course, we can't infer them with certainty, so what we're after is the 115 00:08:28,160 --> 00:08:31,171 probability distributions of unobserved variables. 116 00:08:31,171 --> 00:08:35,568 And if unobserved variables are not independent of one another, given the 117 00:08:35,568 --> 00:08:40,086 observed variables, there is probability distributions are likely to be big 118 00:08:40,086 --> 00:08:43,520 cumbersome things with an exponential number of terms in. 119 00:08:45,220 --> 00:08:48,080 The second problem is the learning problem. 120 00:08:48,440 --> 00:08:54,333 That is, given a training set composed of observed vectors of states of all of the 121 00:08:54,333 --> 00:08:58,143 leaf nodes, How do we adjust the interactions between 122 00:08:58,143 --> 00:09:03,390 variables to make the network more likely to generate that training data? 123 00:09:03,390 --> 00:09:08,565 So, adjusting the interactions would involve both deciding which node is 124 00:09:08,565 --> 00:09:13,669 affected by which other node, And also deciding on the strength of that 125 00:09:13,669 --> 00:09:17,191 effect. So, let me just say a little bit about the 126 00:09:17,191 --> 00:09:21,360 relationship between graphical models and neural networks. 127 00:09:22,760 --> 00:09:28,460 The early graphical models used experts to define the graph structure and also the 128 00:09:28,460 --> 00:09:33,199 conditional probabilities. They would typically take a medical expert 129 00:09:33,199 --> 00:09:38,763 and ask him how likely is this to cause that, and then they would make a graph in 130 00:09:38,763 --> 00:09:43,159 which the nodes had meanings. And they typically had conditional 131 00:09:43,159 --> 00:09:48,653 probability tables that described how a set of values for the parents of a node 132 00:09:48,653 --> 00:09:52,500 would determine the distribution of values for the node. 133 00:09:54,360 --> 00:10:00,159 Their graph was sparsely connected. And the initial problem they focused on 134 00:10:00,159 --> 00:10:04,435 was how to do correct inference. Initially, they weren't interested in 135 00:10:04,435 --> 00:10:07,720 learning because the knowledge came from the experts. 136 00:10:07,720 --> 00:10:12,863 By contrast, for neural nets, learning was always a central issue and hand wiring the 137 00:10:12,863 --> 00:10:17,456 knowledge was regarded as not cool. Although, of course, wiring in some basic 138 00:10:17,456 --> 00:10:21,620 properties, as in convolutional nets, was a very sensible thing to do. 139 00:10:21,620 --> 00:10:27,094 But basically, the knowledge in the net came from learning the training data, not 140 00:10:27,094 --> 00:10:30,378 from experts. Neural networks didn't aim to have 141 00:10:30,378 --> 00:10:34,963 interpretability or sparse connectivity to make the inference easy. 142 00:10:34,963 --> 00:10:39,205 Nevertheless, there are neural network versions of belief nets. 143 00:10:39,205 --> 00:10:43,858 So, if we think about how to make generative models out of idealized 144 00:10:43,858 --> 00:10:47,279 neurons, There's basically two types of generative 145 00:10:47,279 --> 00:10:51,466 model you can make. This energy based models, where you 146 00:10:51,466 --> 00:10:56,650 connect binary stochastic neurons using symmetric connections, and then you get a 147 00:10:56,650 --> 00:11:01,484 Boltzmann machine. A Boltzmann machine, as we've seen, is 148 00:11:01,484 --> 00:11:04,969 hard to learn. But if we restrict the connectivity, then 149 00:11:04,969 --> 00:11:08,137 it's easy to learn a restricted Boltzmann machine. 150 00:11:08,137 --> 00:11:11,939 However, when we do that, we've only learned one hidden layer. 151 00:11:11,939 --> 00:11:17,072 And so, we're giving up on a lot of the power of neural nets with multiple hidden 152 00:11:17,072 --> 00:11:23,708 layers in order to make learning easy. The other kind of model you can make is a 153 00:11:23,708 --> 00:11:27,464 causal model. That is a directed acyclic graph composed 154 00:11:27,464 --> 00:11:32,108 of binary stochastic neurons. And when you do that, you get a sigmoid 155 00:11:32,108 --> 00:11:36,435 belief net. In 1992, Neal introduced models like this 156 00:11:36,435 --> 00:11:42,297 and compared them with Boltzmann machines and showed that Sigmoid belief nets were 157 00:11:42,297 --> 00:11:46,958 slightly easier to learn. So, a Sigmoid belief net is just a belief 158 00:11:46,958 --> 00:11:51,265 net in which all of the variables are binary stochastic neurons. 159 00:11:51,265 --> 00:11:56,280 To generate data from this model, you take the neurons in the top layer. 160 00:11:56,640 --> 00:12:00,673 You determine whether they should be ones or zeros based on their biases, 161 00:12:00,673 --> 00:12:05,110 So you determine out stochastically. And then, given the states of the neurons 162 00:12:05,110 --> 00:12:09,662 in the top layer, you'd make stochastic decisions about what the neurons in the 163 00:12:09,662 --> 00:12:13,580 middle layer should be doing. And then, given their binary states, you 164 00:12:13,580 --> 00:12:16,750 make decisions about what the visible effect should be. 165 00:12:16,750 --> 00:12:22,298 And by doing that sequence of operations, a causal sequence from layer to layer, 166 00:12:22,298 --> 00:12:27,917 You would get an unbiased sample of the kinds of vectors of visible values that 167 00:12:27,917 --> 00:12:32,974 your neural network believes in. So, in a causal model, unlike a Boltzmann 168 00:12:32,974 --> 00:12:35,643 machine, it's easy to generate samples.