In this lecture, I'll introduce belief nets. One of the reason I abandoned back propagation in the 1990's is because it required too many labels. Back then, we just didn't have data sets with sufficient numbers of labels. I was also influenced by the fact that people managed to learn with very few explicit labels. However, I didn't want to abandon the advantages of doing gradient descend learning to learn a whole bunch of weights. So the issue was, was there another objective function that we could do grading decentive? The obvious place to look was generative models where the objective function is to model the input data rather than predicting a label. This meshed nicely with a major movement in statistics and artificial intelligence called the graphical models. The idea of graphical models was to combine discrete graph structures for representing how variables depended on one another. With real valued computations that inferred the probability of one variable, given the observed values of other variables. Boltzmann Machines were actually a very early example of a graphical model, But they were undirected graphical models. In 1992, Radford Neal pointed out that using the same kinds of units as we used in Boltzmann machines, we could make directed graphical models which he called Sigmoid Belief Nets. And the issue then became, how can we learn Sigmoid belief nets? The second problem is that for deep networks, the learning time does not scale well. When there's multiple hidden layers, the learning was very slow. You might ask why this was, And we now know that one of the reasons was we did not initialize the weights in a sensible way. Yet, another problem is the back propagation can get stuck in poor local optima. These are often quite good, so back propagation is useful. But we can now show that for deep nets, the local optima you get stuck in, if you start with small random weights are typically far from optimal. There is the possibility of retreating to simpler models that allow convex optimization. But, I don't think this is a good idea. Mathematicians like to do that because they can prove things. But in practice, you're just running away from the complexity of real data. So, one way to overcome the limits of back propagation is by using unsupervised learning. The idea is that we want to keep the efficiency and simplicity of using a gradient method and stochastic mini batch descent for adjusting weights. But, we're going to use that method for modeling the structure of the sensory input, not for modeling the relation between input and output. So the idea is, the weights are going to be adjusted to maximize the probability that a generative model would have generated the sensory input. We already saw that in learning Boltzmann machines. And one way to think about it is, if you want to do computer vision, you should first learn to do computer graphics. To first order, computer graphics works and computer vision doesn't. The learning objective for a generative model, as we saw with Boltzmann machines, is to maximize the probability of the observed data not to maximize the probability of labels given inputs. Then the question arises, what kind of generative model should we learn? We might learn an energy based model like the Boltzmann machine, Or we might learn a causal model made of idealized neurons, and that's what we'll look at first. Well finally, we might learn some kind of hybrid of the two, and that's where we'll end up. So, before I go into causal belief nets made of neurons, I want to give you a little bit of background about artificial intelligence and probability. In the 1970's and early 1980's, people in artificial intelligence were unbelievably anti-probability. When I was a graduate student, if you mentioned probability, it was assigned that you were stupid and that you just hadn't got it. Computers were all about discrete single processing, and if you'd introduce any probabilities they would just infect everything. It's hard to conceive of how much people are against probability, so here's a quote to help you. I'll read it out. Many ancient Greeks supported Socrates opinion that deep, inexplicable thoughts came from the gods. Today's equivelant to those gods is the erratic, even probabilistic neuron. It is more likely that increased randomness of neural behavior is the problem of the epileptic and the drunk, not the advantage of the brilliant. That was in Patrick Henry Winston's first AI textbook, in the first edition. And it was the general opinion at the time. Winston was to become the leader of the MIT AI Lab. Here's an alternative view. All of this will lead to theories of computation which are much less rigidly of an all-or-none nature than past and present formal logic. There are numerous indications to make us believe that this new system of formal logic will move closer to another discipline which has been little linked in the past with logic. This is thermodynamics, primarily in the form it was received from Boltzmann. That was written by John von Neumann in 1957, and was part of the unfinished manuscript he left behind for what was to be his crowning achievement, His book on The Computer and the Brain. I think if von Neumann had lived, the history of artificial intelligence might have been somewhat different. So, probabilities eventually found their way into AI by something called graphical models, Which are a marriage of graph theory and probability theory. In the 1980's, there was a lot of work on expert systems in AI that use bags of rules for tasks such as, medical diagnosis or exploring for minerals. Now, these were practical problems so they had to deal with uncertainty. They couldn't just use toy examples where everything was certain. People in AI dislike probability so much that even when they were dealing with uncertainty, they didn't want to use probabilities. So, they made up their own ways of dealing with uncertainties that did not involve probabilities. You can actually prove that this is a bad bet. Graphical models were introduced by Pearl, Heckman, Lauritz and many others who shared that probabilities actually worked better than the ad hoc methods developed by people doing expert systems. Discrete graphs were good for representing what variable dependent on what other variables. But once you have those graphs, you then needed to do real value computations that respected the rules of probability so that you could compute the expected values of some nodes in the graph, given the observed states of other nodes. Belief nets is the name that people in graphical models give to a particular subset of graphs which are directed acyclic graphs. And typically, they use sparsely connected ones. And if those graphs are sparsely connected, they have clever inference algorithms that can compute the probabilities of unobserved nodes efficiently. But, these clever of algorithms are exponential in the number of nodes that influence each node, so they won't work for densely connected nodes. So, belief net is directed acyclic graph composed of stochastic variables, And here's a picture of one. In general, you might observe any of the variables. I'm going to restrict myself to nets in which you only observe the leaf nodes. So, we imagine is these unobserved hidden causes, and they may be lead, And they eventually give rise to some observed effects. Once we observe some variables, there's two problems we'd like to solve. The first is what I call the inference problem, and that's to infer the states of unobserved variables. Of course, we can't infer them with certainty, so what we're after is the probability distributions of unobserved variables. And if unobserved variables are not independent of one another, given the observed variables, there is probability distributions are likely to be big cumbersome things with an exponential number of terms in. The second problem is the learning problem. That is, given a training set composed of observed vectors of states of all of the leaf nodes, How do we adjust the interactions between variables to make the network more likely to generate that training data? So, adjusting the interactions would involve both deciding which node is affected by which other node, And also deciding on the strength of that effect. So, let me just say a little bit about the relationship between graphical models and neural networks. The early graphical models used experts to define the graph structure and also the conditional probabilities. They would typically take a medical expert and ask him how likely is this to cause that, and then they would make a graph in which the nodes had meanings. And they typically had conditional probability tables that described how a set of values for the parents of a node would determine the distribution of values for the node. Their graph was sparsely connected. And the initial problem they focused on was how to do correct inference. Initially, they weren't interested in learning because the knowledge came from the experts. By contrast, for neural nets, learning was always a central issue and hand wiring the knowledge was regarded as not cool. Although, of course, wiring in some basic properties, as in convolutional nets, was a very sensible thing to do. But basically, the knowledge in the net came from learning the training data, not from experts. Neural networks didn't aim to have interpretability or sparse connectivity to make the inference easy. Nevertheless, there are neural network versions of belief nets. So, if we think about how to make generative models out of idealized neurons, There's basically two types of generative model you can make. This energy based models, where you connect binary stochastic neurons using symmetric connections, and then you get a Boltzmann machine. A Boltzmann machine, as we've seen, is hard to learn. But if we restrict the connectivity, then it's easy to learn a restricted Boltzmann machine. However, when we do that, we've only learned one hidden layer. And so, we're giving up on a lot of the power of neural nets with multiple hidden layers in order to make learning easy. The other kind of model you can make is a causal model. That is a directed acyclic graph composed of binary stochastic neurons. And when you do that, you get a sigmoid belief net. In 1992, Neal introduced models like this and compared them with Boltzmann machines and showed that Sigmoid belief nets were slightly easier to learn. So, a Sigmoid belief net is just a belief net in which all of the variables are binary stochastic neurons. To generate data from this model, you take the neurons in the top layer. You determine whether they should be ones or zeros based on their biases, So you determine out stochastically. And then, given the states of the neurons in the top layer, you'd make stochastic decisions about what the neurons in the middle layer should be doing. And then, given their binary states, you make decisions about what the visible effect should be. And by doing that sequence of operations, a causal sequence from layer to layer, You would get an unbiased sample of the kinds of vectors of visible values that your neural network believes in. So, in a causal model, unlike a Boltzmann machine, it's easy to generate samples.