1 00:00:00,000 --> 00:00:06,321 So now, we'll go beyond least squares. However, let me mention that the technique 2 00:00:06,321 --> 00:00:12,230 that we're going to cover this week is least squares, which have already been 3 00:00:12,230 --> 00:00:18,062 covered. And, what we will follow is essentially a broad overview of a variety 4 00:00:18,062 --> 00:00:24,047 of different areas dealing with different types of prediction models and their 5 00:00:24,047 --> 00:00:29,880 relationships which are increasingly becoming apparent in research community. 6 00:00:29,880 --> 00:00:35,379 Let's first see what happens if we have categorical data. That is that, the target 7 00:00:35,379 --> 00:00:40,608 variables that we are trying to predict, the y's, are not numbers but yes is a 8 00:00:40,608 --> 00:00:45,157 node, which is essentially the learning problem that we had earlier. 9 00:00:45,157 --> 00:00:50,385 Well, the challenge here is that the x's are still numbers. So, it might not be 10 00:00:50,385 --> 00:00:56,021 such a great idea to use something like a naive Bayes classifier because one would 11 00:00:56,021 --> 00:01:01,385 have to discretize those numbers into categorical variables which might not be 12 00:01:01,385 --> 00:01:05,780 appropriate. So, one tends to try to use regression 13 00:01:05,780 --> 00:01:12,039 techniques. But, using a linear regression when all one is trying to do is separate 14 00:01:12,039 --> 00:01:17,230 out the no's which is the reds from the blues, which may be yes's. 15 00:01:17,230 --> 00:01:23,455 It's quite brittle because the points right near the line might get classified 16 00:01:23,455 --> 00:01:30,153 either way, and the distinctions one has to make are very small and are subject to 17 00:01:30,153 --> 00:01:33,778 a lot of error. The other problem, of course, is that 18 00:01:33,778 --> 00:01:40,240 there may be no such lines separating the data and we'll come to that in a minute. 19 00:01:40,240 --> 00:01:45,447 But, just to make regression work for categorical data, what the most common 20 00:01:45,447 --> 00:01:51,818 technique being used today is logistic regression. Which essentially replaces f 21 00:01:51,818 --> 00:01:58,417 by a function of the form one - one over e to the -f transposed x, instead of f 22 00:01:58,417 --> 00:02:02,908 transposed x. Let's see how this function behaves. 23 00:02:02,908 --> 00:02:07,400 Now, suppose f transposed x is a very large value. 24 00:02:07,880 --> 00:02:15,700 Then, even the -f transposed x is close to zero, 25 00:02:16,200 --> 00:02:24,769 Which means that one up, one over that is close to infinity and we have one minus of 26 00:02:24,769 --> 00:02:31,094 that, becomes a very, As f transposed x is close to zero, F of x 27 00:02:31,094 --> 00:02:35,888 is close to zero because you have one over one, 28 00:02:36,194 --> 00:02:43,540 And one minus of that is almost zero. But if, as f transposed x increases, 29 00:02:44,000 --> 00:02:52,024 The denominator here rapidly decreases to almost zero which makes the second term 30 00:02:52,024 --> 00:03:00,244 here rapidly increased to infinity, where f transposed x very rapidly goes away from 31 00:03:00,244 --> 00:03:06,213 zero to a negative value. On the other hand, if f transposed x is, 32 00:03:06,213 --> 00:03:14,969 becomes a negative value, Even the x transpose x rapidly increases 33 00:03:14,969 --> 00:03:23,366 towards infinity making the second term almost zero. So, fx) of x goes close to 34 00:03:23,366 --> 00:03:28,015 one. So, the speed at which the second term 35 00:03:28,015 --> 00:03:34,826 deviates from one is what makes logistic regression particularly attractive to 36 00:03:34,826 --> 00:03:41,132 separate out the yes's from the no's. Of course, trying to fit such a function 37 00:03:41,132 --> 00:03:47,439 is not as easier as linear squares. But I'll show how that can be done very 38 00:03:47,439 --> 00:03:52,526 soon. The other problem with linear regression 39 00:03:52,526 --> 00:03:56,698 is that the data might not be separable using a line. 40 00:03:56,698 --> 00:04:03,073 So, one might, for example, you might have a parabolic relationship, as we described 41 00:04:03,073 --> 00:04:06,693 earlier. To support more complex functions, and 42 00:04:06,693 --> 00:04:11,180 when one doesn't even know what kind of function to have, 43 00:04:11,480 --> 00:04:16,948 The technique called support vector machines have become very popular. And 44 00:04:16,948 --> 00:04:23,043 they're the kind of function, f, or is parameterized by kernel parameters, and 45 00:04:23,043 --> 00:04:28,326 these parameters are also learned along with the, the function itself. 46 00:04:28,326 --> 00:04:33,903 So, not only does the we learn the function, but we also learn the type of 47 00:04:33,903 --> 00:04:38,306 function from the data. And so, if it's a parabola, you learn a 48 00:04:38,306 --> 00:04:42,048 parabola. If it's something more complicated, like a 49 00:04:42,048 --> 00:04:47,992 circle with a hole inside, rather like a doughnut, the support vector machine will 50 00:04:47,992 --> 00:04:53,560 learn such functions as well. The point I'm trying to make is that 51 00:04:53,560 --> 00:04:58,613 starting from linear least squares, one, one increases the complexity of the 52 00:04:58,613 --> 00:05:03,111 functions one is trying to learn. And we arrive at other types of 53 00:05:03,111 --> 00:05:12,700 regressions and support vector machines. Way back in the early days of the AI, one 54 00:05:13,020 --> 00:05:19,407 found many efforts to mimic the brain, which essentially resulted in the field of 55 00:05:19,407 --> 00:05:23,980 neural networks. Where one tried to create structures that 56 00:05:23,980 --> 00:05:29,973 sort of look like how neurons in the brain affect each other, based on their 57 00:05:29,973 --> 00:05:38,460 connections to other neurons. The first neural networks created in the 58 00:05:38,460 --> 00:05:45,794 50's by McCullouch, Pitts, Rosenblatt and others essentially, look like linear 59 00:05:45,794 --> 00:05:50,443 combinations of inputs. So, they were pretty much like least 60 00:05:50,443 --> 00:05:55,371 squares, in the sense that we had, you would have a neural network, which would 61 00:05:55,371 --> 00:06:00,554 say that these are neurons and the input to this neuron is the rainfall and this 62 00:06:00,554 --> 00:06:05,994 one's the temperature. This one's the harvest rainfall, and then we have another 63 00:06:05,994 --> 00:06:10,281 one which has just one. And, the output would be the wine quality 64 00:06:10,281 --> 00:06:15,720 and the weights, that is the connections between these neurons would be learned by, 65 00:06:15,720 --> 00:06:21,352 by a process of iterative refinement which could be essentially looked upon as a way 66 00:06:21,352 --> 00:06:27,587 of solving the least squares problem. To deal with exactly the same problems of 67 00:06:27,587 --> 00:06:34,124 classification, we found logistic functions being included in neural 68 00:06:34,124 --> 00:06:38,451 networks. So essentially, the, the, this field was 69 00:06:38,451 --> 00:06:44,160 evolving parallely to statistical prediction techniques. 70 00:06:44,820 --> 00:06:51,469 And interest in neural networks, sort of wind over the years along the way, of 71 00:06:51,469 --> 00:06:58,866 course, we figured out different types of neural networks which were multi-layer. 72 00:06:59,116 --> 00:07:04,859 And essentially, by combination of logistic functions, least linear 73 00:07:04,859 --> 00:07:11,433 functions, and other types of functions, one could learn more complex non-linear 74 00:07:11,433 --> 00:07:18,994 functions of the input variables. Hidden layers were introduced which were 75 00:07:18,994 --> 00:07:22,787 essentially creating more complicated forms of f., 76 00:07:22,787 --> 00:07:28,903 And parametrized by the weights which would then get learned through a process 77 00:07:28,903 --> 00:07:36,146 of optimization as we'll see shortly. Finally, in recent years, in spite of the 78 00:07:36,146 --> 00:07:42,760 fact that interest in neural networks has waned over the years, these have become 79 00:07:43,660 --> 00:07:48,570 interesting once again with the notion of feedback. 80 00:07:48,570 --> 00:07:52,405 Notice that these networks that are shown here, 81 00:07:52,405 --> 00:07:57,790 All the links go forward from the input nodes to the output nodes. 82 00:07:57,790 --> 00:08:02,442 So, they are feed forward even though they're multi-layer. 83 00:08:02,442 --> 00:08:08,154 But, when we start having feedback, That is that an input, a node in the 84 00:08:08,154 --> 00:08:14,519 middle, output is fed back into the previous nodes this starts behaving like a 85 00:08:14,519 --> 00:08:19,660 Bayesian belief network. In particular, feedback neural networks 86 00:08:19,660 --> 00:08:26,604 display the same kind of explaining away effect which we found in Bayesian networks 87 00:08:26,604 --> 00:08:34,141 that once you figure out that something, a particular cause is, is more likely, other 88 00:08:34,141 --> 00:08:40,323 causes which might also contribute to this input node become less likely. 89 00:08:40,323 --> 00:08:47,098 So, that makes this much more interesting than the older neural networks and has 90 00:08:47,098 --> 00:08:50,740 sparked much recent interest in this field. 91 00:08:50,740 --> 00:08:56,218 The field of deep belief networks, Multilayer-feedback neural networks, 92 00:08:56,443 --> 00:09:01,696 Temporal neural networks, All this are all coming together and deep 93 00:09:01,696 --> 00:09:05,148 relationships between them are being explored. 94 00:09:05,148 --> 00:09:08,675 We'll, we'll come to these, these points soon. 95 00:09:08,675 --> 00:09:11,827 But, for the moment, lets try to understand, 96 00:09:11,827 --> 00:09:16,930 If one has such a complex f, how would one learn the parameters of f?