1 00:00:00,000 --> 00:00:06,909 Let's now revisit machine learning, like we studied during the listen chapter and 2 00:00:06,909 --> 00:00:12,540 the naive based classifier is one method for the, machine learning. 3 00:00:13,640 --> 00:00:17,420 Let's try to do this a little bit formally. 4 00:00:18,000 --> 00:00:23,840 In a manner so as to make some of the material that will come later fall 5 00:00:23,840 --> 00:00:31,560 naturally into the same formal framework. We have a bunch of data. 6 00:00:31,560 --> 00:00:39,753 In this case, features X1 to XN which are. We call X, capital X a, a vector, a 7 00:00:39,753 --> 00:00:45,100 collection of features. Earlier we had different examples, you 8 00:00:45,100 --> 00:00:50,097 know, the query terms that we used, the words in a comment. 9 00:00:50,097 --> 00:00:56,935 And in general, anything that we observe about instances which we then want to 10 00:00:56,935 --> 00:01:02,356 classify into different. Classes and then we have some output 11 00:01:02,356 --> 00:01:08,785 variables which are the labels that we would like to assign to a class, whether a 12 00:01:08,785 --> 00:01:14,420 positive sentiment or a negative sentiment or a buyer, or a browser, and. 13 00:01:15,060 --> 00:01:18,844 In this case, we'll say the output variable is zero. 14 00:01:18,844 --> 00:01:22,331 If it's a buyer, a one. If it's a browser, a zero. 15 00:01:22,331 --> 00:01:26,190 If it's a negative, a one. If it's positive, and so on. 16 00:01:26,190 --> 00:01:30,420 In general, we might have many different output variables. 17 00:01:30,720 --> 00:01:38,090 Each of them 01, and. The points in the space, X, could be shown 18 00:01:38,090 --> 00:01:41,956 as, you know, points in, in some high dimensional space. 19 00:01:42,175 --> 00:01:45,750 I've shown them here as just points in the plane. 20 00:01:45,750 --> 00:01:49,835 But in general each of these will have many coordinates. 21 00:01:49,835 --> 00:01:54,284 Like N for example. And some of them are red which means they 22 00:01:54,284 --> 00:01:58,150 have zero assigned to them. And some of them are blue. 23 00:01:58,150 --> 00:02:03,330 That means they have a one assigned to them, in terms of the Y variable. 24 00:02:03,330 --> 00:02:10,031 And the,, the goal of classification is to figure out, given a new point whether it's 25 00:02:10,031 --> 00:02:14,687 likely to bem red or blue. In general, this need not to be a binary 26 00:02:14,687 --> 00:02:20,401 task and you may want to classified in to three, four many classes, but the moment 27 00:02:20,401 --> 00:02:25,340 we stick to binary classification, as it makes things a bit simpler. 28 00:02:26,600 --> 00:02:34,638 Now, to formally model the classification problem, if you have a bunch of features 29 00:02:34,638 --> 00:02:42,860 in a space, we define a function, let's say, F of X which is the expected value Of 30 00:02:42,860 --> 00:02:48,206 the output variables y, given the input variable, x. 31 00:02:48,206 --> 00:02:56,117 Now, this is a little bit different than what we had earlier, which was the 32 00:02:56,117 --> 00:03:01,444 probability. Of getting some variable Y equal to zero 33 00:03:01,444 --> 00:03:07,885 or one, given a particular combination of X's but as we shall soon see it really its 34 00:03:07,885 --> 00:03:12,947 the same thing, at least for classification, but for other types of 35 00:03:12,947 --> 00:03:19,082 machine learning we'll have different values of F and that's why this framework 36 00:03:19,082 --> 00:03:24,220 is useful so that it unifies our whole concept of machine learning. 37 00:03:25,340 --> 00:03:31,602 So, what we're trying to do is figure out what is the expected value of y. 38 00:03:31,602 --> 00:03:36,749 Is it a zero or a one? Is it closer to zero or closer to one 39 00:03:36,749 --> 00:03:44,760 given a particular combination x? Well, the expected value is nothing but. 40 00:03:45,680 --> 00:03:50,560 And clearly, Y is zero one, then the expected value is. 41 00:03:50,560 --> 00:03:56,928 One times the probability of Y equal to one, given X, plus zero times the 42 00:03:56,928 --> 00:04:00,820 probability that Y equal to zero, given X. So. 43 00:04:00,820 --> 00:04:07,012 Since you multiply by zero, this term disappears, and we simply get the 44 00:04:07,012 --> 00:04:14,088 probability that Y equal to one, given X. And this is exactly what we had earlier 45 00:04:14,088 --> 00:04:20,900 estimated using various assumptions, like independence and the bye, Bayes rule. 46 00:04:21,400 --> 00:04:27,617 And we will, could figure out that then we, all we needed was a training set were 47 00:04:27,617 --> 00:04:33,678 we could compute the required likelihoods that would enable us to compute the 48 00:04:33,678 --> 00:04:37,720 probability of y equal to one given a combination x. 49 00:04:38,000 --> 00:04:44,599 Now that's, that's fairly basic. But this formulation tells us a few 50 00:04:44,599 --> 00:04:45,880 things. First. 51 00:04:46,580 --> 00:04:54,760 It's not always necessary to compute this probability y equal to one given x. 52 00:04:55,300 --> 00:04:59,470 Directly. We did that through some approximation in 53 00:04:59,470 --> 00:05:03,559 ninth phase. We might instead want to compute this 54 00:05:03,559 --> 00:05:08,220 function F of X. In this case it turns out to be just the 55 00:05:08,220 --> 00:05:13,535 same as this probability. In other cases it might not be the same. 56 00:05:13,535 --> 00:05:20,568 For example if you have a classifier where you didn't have two classes you might have 57 00:05:20,568 --> 00:05:26,340 a different combination of F. The second thing it tells you is that. 58 00:05:26,680 --> 00:05:33,383 The probability. It's one way of figuring out where a 59 00:05:33,383 --> 00:05:40,670 particular combination X belongs. Another way might be to somehow estimate 60 00:05:40,670 --> 00:05:49,944 this F and estimate the boundary. Where F of X is more than or less than a 61 00:05:49,944 --> 00:05:54,335 half. For example, in this case, if you could 62 00:05:54,335 --> 00:06:00,282 figure out that the boundary lay somewhere along this line over here. 63 00:06:00,282 --> 00:06:06,229 So that we minimize the number of false positives and true negatives. 64 00:06:06,229 --> 00:06:12,779 We could find a much easier way to classify the points as being positive or 65 00:06:12,779 --> 00:06:20,105 negative than actually computing this rather complex animal, which is the a 66 00:06:20,105 --> 00:06:26,383 posteriori probability. Directly trying to find estimates of F is 67 00:06:26,383 --> 00:06:32,136 what more sophisticated machine learning techniques, like support vector machines, 68 00:06:32,136 --> 00:06:36,091 end up doing. They do other complicated things like, 69 00:06:36,091 --> 00:06:41,915 changing the nature of the space. So instead of dealing with the space at 70 00:06:41,915 --> 00:06:47,899 its given, they use, various combinations of the x's, sometimes squaring them, 71 00:06:47,899 --> 00:06:53,484 sometimes doing, other, funny things to them to, project this space to a 72 00:06:53,484 --> 00:06:59,468 different, space, and then find, a line or a plane which, nicely separates, the, 73 00:06:59,707 --> 00:07:05,704 instances of positive and negative. So what they're really doing is trying e-, 74 00:07:05,704 --> 00:07:10,476 estimate the function boundary F directly, without actually computing the function 75 00:07:10,476 --> 00:07:14,381 directly. Now, that's another long-winded way of 76 00:07:14,381 --> 00:07:18,200 defining machine learning, but it, it will serve to. 77 00:07:18,620 --> 00:07:24,869 Unify the different types of learning that we'll see as we go along this week. 78 00:07:24,869 --> 00:07:31,520 Let's take a look at some examples using this new notation that we just introduced. 79 00:07:31,920 --> 00:07:39,147 Recall our machine learning of whether somebody's a, buyer or a browser, based on 80 00:07:39,147 --> 00:07:42,979 the queries that they, issue in a search box. 81 00:07:42,979 --> 00:07:49,684 In this case the queries that we had last time were, red, flower, gift, or cheap. 82 00:07:49,684 --> 00:07:56,650 And we had various instances of people querying, and then possibly buying or not 83 00:07:56,650 --> 00:08:03,611 buying. So our x was essentially the set of query 84 00:08:03,611 --> 00:08:11,480 words, y was whether or not one buys. And so we have y, x as our. 85 00:08:12,120 --> 00:08:19,720 Space, B, R, F, G and C in the notation we had last time. 86 00:08:20,480 --> 00:08:26,587 Similarly, we had machine learning of positive or negative comments. 87 00:08:26,587 --> 00:08:33,782 In this case, the sentiment was the y. The set of all words formed our X, which 88 00:08:33,782 --> 00:08:40,613 is extremely high dimensional space. A word could either be present or absence 89 00:08:40,613 --> 00:08:45,780 in a comment, and all of these would still binary variables. 90 00:08:48,120 --> 00:08:58,900 We turn to another example now. Imagine a baby observing various animals, 91 00:09:00,660 --> 00:09:06,605 And wants to figure out which animal is of what name. 92 00:09:06,605 --> 00:09:12,719 So, we have various features, the size of the animal, the size of the head, the 93 00:09:12,719 --> 00:09:18,430 noise the animal makes, the number of legs it has and the animal itself. 94 00:09:18,430 --> 00:09:24,624 And the baby observes many instances and somehow it is able to discern these 95 00:09:24,624 --> 00:09:30,898 features in terms of whether or not the size of the head is large or small or 96 00:09:30,898 --> 00:09:34,759 medium. Or whether the animal itself is large or 97 00:09:34,759 --> 00:09:38,460 small. And the noise that the animal makes, etc. 98 00:09:39,980 --> 00:09:47,253 The machine learning task is then to classify a new animal into appropriate 99 00:09:47,253 --> 00:09:50,220 category. Lion, cat, elephant, etc. 100 00:09:50,900 --> 00:09:58,772 So here Y comma X is now the animal itself is Y, size of the animal, it's head size, 101 00:09:58,772 --> 00:10:04,436 it's noise, number of legs or the features, it's a fixed set. 102 00:10:04,436 --> 00:10:12,020 So here on the fore features of multi valued categorical variable so they take. 103 00:10:12,380 --> 00:10:17,003 Values in specific categories. They don't, they're not real numbers for 104 00:10:17,003 --> 00:10:19,645 example. They're, they take one of, of few 105 00:10:19,645 --> 00:10:22,420 categories: small, medium, large for example. 106 00:10:22,880 --> 00:10:30,442 And lastly we consider another example where we have customers going to 107 00:10:30,442 --> 00:10:38,009 supermarkets and buying bunches of products together in transactions A 108 00:10:38,009 --> 00:10:42,263 transaction might consists of milk, diapers and some Cola. 109 00:10:42,263 --> 00:10:46,220 An other one might consists of some diapers and beer. 110 00:10:46,220 --> 00:10:50,930 Yet more transactions will have different products. 111 00:10:50,930 --> 00:10:56,089 And here we have. Interestingly no output variable, but only 112 00:10:56,089 --> 00:11:02,925 features which are just the items that people buy and the items are again multi 113 00:11:02,925 --> 00:11:10,189 valued categorical variables but they're a variables set just like the comments that 114 00:11:10,189 --> 00:11:17,111 we had earlier that we could, we could consider the set of all possible items and 115 00:11:17,111 --> 00:11:23,350 have binary variables or we could have a variable number of features and. 116 00:11:23,350 --> 00:11:32,140 Multi-valued categorical variables. These examples are all cases where one 117 00:11:32,140 --> 00:11:35,294 could do, various kinds of machine learning. 118 00:11:35,294 --> 00:11:40,942 In the first three examples, that is queries, comments and animals, there was a 119 00:11:40,942 --> 00:11:46,443 clear output variable, which indicated the class of the particular instance. 120 00:11:46,443 --> 00:11:50,744 And so one could. Imagine a supervised machine learning 121 00:11:50,744 --> 00:11:56,372 scenario, using something like a naive based classifier, to compute the 122 00:11:56,372 --> 00:12:02,804 likelihoods and estimate the A plus GRI probability, or as we have just seen, the 123 00:12:02,804 --> 00:12:08,833 expected value of the class, given X. In the last example transactions, there 124 00:12:08,833 --> 00:12:15,184 was no output variable, so the task there is a little bit different, which, which 125 00:12:15,184 --> 00:12:22,910 we, which we shall come to, very soon. Do go back over the formalism, and make 126 00:12:22,910 --> 00:12:31,036 sure that you're able to figure out how. Classification happens in each of these 127 00:12:31,036 --> 00:12:34,895 cases. That is exactly how the Y and the X are 128 00:12:34,895 --> 00:12:41,440 formulated so that we get a formula, a formal representation of the problem in 129 00:12:41,440 --> 00:12:48,722 terms of features and an output variable. In this particular case of course there is 130 00:12:48,722 --> 00:12:52,051 no output variable as we have already mentioned. 131 00:12:52,051 --> 00:12:55,520 And let's now get into the reason why we had that.