1 00:00:00,000 --> 00:00:06,507 Whatever our model, One needs to minimize f of x minus y. 2 00:00:06,507 --> 00:00:13,770 Otherwise, the, the data, the value that you want to predict you want to minimize 3 00:00:13,770 --> 00:00:16,726 the error between the, what you already have. 4 00:00:16,928 --> 00:00:21,362 Real values and your predicted values for those real values.. 5 00:00:22,420 --> 00:00:28,516 Well, f is complicated, there's no formula like the normal equations that we had for 6 00:00:28,516 --> 00:00:32,850 linear least squares. So, all these methods, whether they are 7 00:00:33,071 --> 00:00:38,800 regression with logistic function, support vector machines, neural networks, all 8 00:00:38,800 --> 00:00:42,400 learn the parameters through uniterative process. 9 00:00:42,900 --> 00:00:49,886 We start with some f which is a parameterization of the function f, 10 00:00:49,886 --> 00:00:53,171 It maybe not be linear like f transposed x. 11 00:00:53,171 --> 00:00:58,519 It could be some other parameterization so, you know, you could have squares, you 12 00:00:58,519 --> 00:01:02,338 could have exponentials, you can all kinds of things. 13 00:01:02,338 --> 00:01:08,755 But, you have some parameters which tells you what the function f for each choice of 14 00:01:08,755 --> 00:01:12,804 parameters is. And, we need to refine it alliteratively 15 00:01:12,804 --> 00:01:19,484 to find the f which minimizes the error. In the world of neural networks, the first 16 00:01:19,484 --> 00:01:24,563 search algorith was called back propagation which has nothing going to do 17 00:01:24,563 --> 00:01:30,286 with the feed back, neural networks, is just a new algorithms, it is just going to 18 00:01:30,286 --> 00:01:34,077 learn self. While the other areas which were probably 19 00:01:34,077 --> 00:01:40,157 more mathematically founded, straight forward optimization techniques were used 20 00:01:40,157 --> 00:01:45,635 to find the minimum of this error. So, if we start off with a guess and we 21 00:01:45,635 --> 00:01:50,790 choose maybe a slightly different f as two different guesses, 22 00:01:50,790 --> 00:01:57,195 One way to start minimizing this error or moving towards a value of f which is 23 00:01:57,195 --> 00:02:02,740 likely to have lower error, is to use something called gradient descent. 24 00:02:02,740 --> 00:02:08,520 This formula looks a bit daunting so let me explain it in simple language. 25 00:02:09,240 --> 00:02:13,132 Epsilon of f is a function of the parameters of f. 26 00:02:13,132 --> 00:02:19,750 So, one can measure how fast this function changes if one changes each component of f 27 00:02:19,750 --> 00:02:25,824 separately, which is nothing but taking the partial derivative of the function 28 00:02:25,824 --> 00:02:31,118 with respect to that component. String all those partial derivatives 29 00:02:31,118 --> 00:02:36,880 together, and you'll get a vector which is just this del sub f of epsilon. 30 00:02:37,340 --> 00:02:45,071 That vector tells you which direction to move each component of f so as to reduce 31 00:02:45,071 --> 00:02:49,320 subset. So, for example, the first component of f 32 00:02:49,760 --> 00:02:55,434 is such that if you increase it, then epsilon decreases, then you would choose 33 00:02:55,434 --> 00:03:00,520 to increase that component of f. If the second component is one which 34 00:03:00,960 --> 00:03:07,684 decreases epsilon, only if you decrease that component, you will decrease f so 35 00:03:07,684 --> 00:03:12,749 then the, that particular element of del would be negative. 36 00:03:12,749 --> 00:03:19,211 So, you'd get this vector del, computed using your own value of f of r of f. 37 00:03:19,211 --> 00:03:26,198 That means, your previous iterations start with f0, for example. And then, add some 38 00:03:26,198 --> 00:03:33,010 small multiple of this derivative to your current guess to create a new guess. 39 00:03:33,010 --> 00:03:41,400 Now, you need to compute these derivatives so you use the difference is between 40 00:03:41,583 --> 00:03:46,489 epsilon at f of i and eps, epsilon at f of i minus one divided by, you know, f of i 41 00:03:46,489 --> 00:03:51,300 minus f of i minus one component-wise to get estimates for this derivative and 42 00:03:51,300 --> 00:03:56,250 that's why you needed two guesses to start with so, because you have to bootstrap 43 00:03:56,250 --> 00:03:59,305 this duration. Now, this is a very popular and very 44 00:03:59,305 --> 00:04:03,973 simple method and you can have. More complicated ways of finding the 45 00:04:03,973 --> 00:04:10,195 minimum value of, of epsilon by, by using Newton's iteration where you use second 46 00:04:10,195 --> 00:04:15,217 derivatives in addition to first derivatives and various different 47 00:04:15,217 --> 00:04:20,540 techniques in the mathematics literature are available to you. 48 00:04:23,320 --> 00:04:30,713 This works fine if you have numbers, that means if you have the x's are actual 49 00:04:30,713 --> 00:04:36,978 values as we've been assuming, but there are, of course, some caveats. 50 00:04:37,187 --> 00:04:42,898 This iteration sort of works and all situations work provided there is only, 51 00:04:42,898 --> 00:04:48,610 you know, one minima and you're actually starting very close to that minima if, if 52 00:04:48,610 --> 00:04:52,982 there are multiple minimas. And if there are many different minimas, 53 00:04:53,149 --> 00:04:57,490 you might get stuck in a local minima, and not, and not really get to the real 54 00:04:57,490 --> 00:05:00,744 minimum value. So, so you need better techniques. 55 00:05:00,744 --> 00:05:06,826 The other thing would be that the function f might not necessarily be parametrized by 56 00:05:06,826 --> 00:05:10,573 a single set of values there may be some constraints. 57 00:05:10,573 --> 00:05:16,725 You might have a function that says, that you have five different lines which meet 58 00:05:16,725 --> 00:05:20,968 at, at the, the connecting points and that's your function. 59 00:05:20,968 --> 00:05:25,070 So, you're going to try to fit five lines to all your data. 60 00:05:25,070 --> 00:05:30,331 Then, you have constraints which will then make this situation a little bit more 61 00:05:30,331 --> 00:05:33,620 complicated and, and, and, and, complex as well. 62 00:05:33,620 --> 00:05:37,050 But, we won't go into those details right now. 63 00:05:37,050 --> 00:05:43,336 The other problem is, is if x is not numbers and if x is categorical, at least 64 00:05:43,336 --> 00:05:50,408 some of the x are categories like yes, no, red, blue, or green, then you somehow have 65 00:05:50,408 --> 00:05:56,268 to convert them to a binary form. So, if there are n categorical variables, 66 00:05:56,268 --> 00:06:01,184 you might want to convert, convert them to, for example, zero or one. 67 00:06:01,407 --> 00:06:05,056 Of course, now, you're going to have many more than n. 68 00:06:05,056 --> 00:06:10,195 So, you would say like words. You, you have categorical variables which 69 00:06:10,195 --> 00:06:15,930 are words and you suddenly increase the dimension of, you know, space from say, 70 00:06:15,930 --> 00:06:21,423 You're, you're talking about one word, but words could be out of a set of millions. 71 00:06:21,423 --> 00:06:26,712 So now, you have a million dimensional space and it's zero or one depending on 72 00:06:26,712 --> 00:06:31,188 whether that word is there. And that tremendously complicates such 73 00:06:31,188 --> 00:06:36,885 methods and they often become unviable if your categorical data is highly, very high 74 00:06:36,885 --> 00:06:41,700 dimension and you have to use different techniques to learn parameters. 75 00:06:43,040 --> 00:06:50,069 You could also convert the these techniques, these categorical variables to 76 00:06:50,069 --> 00:06:57,273 real numbers through a positive, through a process of reversed classification, which 77 00:06:57,273 --> 00:07:04,303 we wont talk about, but essentially there, you, you think about high, low, and medium 78 00:07:04,303 --> 00:07:11,791 as function which are not fixed as being high, or low, or medium but could vary as 79 00:07:11,791 --> 00:07:15,560 something being 50% high, twenty percent low, etc. 80 00:07:16,180 --> 00:07:25,292 And then, use such fuzzy membership to convert your categorical variables to, to, 81 00:07:25,292 --> 00:07:32,918 into, into real numbers. And then, once you have real numbers, you 82 00:07:32,918 --> 00:07:39,195 can use this technique or you could deal with the space of categorical data itself. 83 00:07:39,195 --> 00:07:44,867 And then, there are techniques like neighborhood search, heuristic AI search, 84 00:07:44,867 --> 00:07:50,917 genetic algorithms, and such techniques which don't rely on converting your 85 00:07:50,917 --> 00:07:55,756 parameters into real space. And then, you get the problem of a 86 00:07:55,756 --> 00:08:02,306 combinatorial explosion and you have to use all kinds of techniques to figure out 87 00:08:02,306 --> 00:08:09,605 good ways of minimizing the error. And lastly, as we have seen in previous 88 00:08:09,605 --> 00:08:17,219 lectures, probabilistic models can deal with probabilities of variables taking on 89 00:08:17,219 --> 00:08:23,441 particular categorical values as opposed to having to convert those to numbers or 90 00:08:23,441 --> 00:08:28,165 deal with them directly. And these techniques have become 91 00:08:28,165 --> 00:08:33,937 increasingly powerful and, and popular, precisely because you're able to deal 92 00:08:33,937 --> 00:08:39,185 with, with the numbers without having to get into too much exponential 93 00:08:39,185 --> 00:08:43,514 combinatorial blowup. Now, that's quite a mouthful and many of 94 00:08:43,514 --> 00:08:48,887 you might not have got it. But those of you who are on the verge of 95 00:08:48,887 --> 00:08:55,349 graduate research, might want to listen to that spill a couple of times cuz it does, 96 00:08:55,349 --> 00:09:01,883 sort of, unify different or sort of give a bigger picture of how learning parameters 97 00:09:01,883 --> 00:09:06,240 in predictive models work and what are the possible options. 98 00:09:06,820 --> 00:09:11,840 There are some more linkages which I'd like to explore. 99 00:09:12,114 --> 00:09:16,365 For example, Here we are trying to minimize the error 100 00:09:16,365 --> 00:09:20,907 in a predictive model. In other situations, we might need to 101 00:09:20,907 --> 00:09:27,295 maximize a choice which is want to find the best way to transport goods across the 102 00:09:27,295 --> 00:09:30,528 country, Or the best way to allocate one's 103 00:09:30,528 --> 00:09:33,915 advertising funds across different channels. 104 00:09:33,915 --> 00:09:40,304 So, there you're trying to maximize some function and that's related closely to the 105 00:09:40,304 --> 00:09:46,951 problem of minimizing a prediction error. Similar techniques, gradient-descent, 106 00:09:46,951 --> 00:09:52,122 Newton's method and various others apply all the similar, similar problems in terms 107 00:09:52,122 --> 00:09:56,080 of categorical data, Local minima, and constraints all apply as 108 00:09:56,080 --> 00:10:00,851 well. Then, once one has decided where one wants 109 00:10:00,851 --> 00:10:04,176 to go, one needs to figure out how to get there. 110 00:10:04,176 --> 00:10:09,978 So typically, if you're driving a car, for example, you have a system and your angle 111 00:10:09,978 --> 00:10:14,931 of steering could be, say, theta. And every time you, you, you change your 112 00:10:14,931 --> 00:10:20,308 theta, your car will move, move in a certain direction so you'll have a, a new 113 00:10:20,308 --> 00:10:24,340 state of your system. And the goal is to figure out which 114 00:10:24,340 --> 00:10:31,356 sequence of control actions will ultimately minimize the difference between 115 00:10:31,356 --> 00:10:35,560 the state of your system and where you want that state to be. 116 00:10:35,560 --> 00:10:41,236 And there could be other constraints that you want that paths, paths to be fairly 117 00:10:41,236 --> 00:10:44,434 smooth, etc. So, you get all those problems of 118 00:10:44,434 --> 00:10:47,827 constraints again. Again, it's a minimization problem. 119 00:10:47,827 --> 00:10:52,916 And here, unlike in the first two cases of minimizing the prediction error and, or 120 00:10:52,916 --> 00:10:58,332 optimizing a function each iteration is actually executed and the response of a 121 00:10:58,332 --> 00:11:03,813 system, rather than a function kind of abstractly is what you're dealing with. 122 00:11:03,813 --> 00:11:06,880 So, an actual control system will make a change, 123 00:11:06,880 --> 00:11:11,402 See what the output is and make another change again, techniques like 124 00:11:11,402 --> 00:11:16,384 gradient-descent,, Newton's method, and all that start applying, but now you're 125 00:11:16,384 --> 00:11:19,858 controlling actions. So, from predicting the future or 126 00:11:19,858 --> 00:11:26,967 predicting where things are going to go to figuring out how to best utilize that 127 00:11:26,967 --> 00:11:36,137 information to optimize one's own goals. And then, to decide how to change ones 128 00:11:36,137 --> 00:11:43,240 behavior or change ones system, so as to get to the gold state, all these are 129 00:11:43,240 --> 00:11:47,667 tightly related. And, I hope you've seen that the 130 00:11:47,667 --> 00:11:53,940 mathematics is related, now lets talk about some of the applications. 131 00:11:54,320 --> 00:11:59,492 Before we do that, of course, in practice, whether you're dealing with support vector 132 00:11:59,492 --> 00:12:04,291 machines, logistic regression, neural networks, or what have you, you won't ever 133 00:12:04,291 --> 00:12:08,840 really have to do these things. There are packages available that let you 134 00:12:08,840 --> 00:12:14,387 choose let, let you apply any of these techniques without having to get into the 135 00:12:14,387 --> 00:12:18,749 math, or get into the duration. The important thing is, of course, trying 136 00:12:18,749 --> 00:12:23,050 to choose which package to use. And, as soon as we do the applications, 137 00:12:23,050 --> 00:12:29,841 I'll try to summarize which package to use, what kind of techniques to use from 138 00:12:29,841 --> 00:12:33,968 my experience. And then, we move on to more issues of 139 00:12:33,968 --> 00:12:35,774 deep belief networks.