1 00:00:00,210 --> 00:00:02,931 Let's start talking about logistic regression. 2 00:00:02,950 --> 00:00:04,315 In this video, I'd like to 3 00:00:04,315 --> 00:00:07,210 show you the hypothesis representation, that 4 00:00:07,210 --> 00:00:08,805 is, what is the function 5 00:00:08,810 --> 00:00:10,266 we're going to use to represent 6 00:00:10,300 --> 00:00:15,446 our hypothesis where we have a classification problem. 7 00:00:15,450 --> 00:00:16,969 Earlier, we said that we 8 00:00:16,969 --> 00:00:20,426 would like our classifier to 9 00:00:20,426 --> 00:00:21,956 output values that are between 10 00:00:21,956 --> 00:00:23,250 zero and one. So, we 11 00:00:23,270 --> 00:00:24,566 like to come up with a 12 00:00:24,566 --> 00:00:26,385 hypothesis that satisfies this 13 00:00:26,385 --> 00:00:30,396 property, that these predictions are maybe between zero and one. 14 00:00:30,396 --> 00:00:32,764 When we were using linear regression, 15 00:00:32,764 --> 00:00:34,262 this was the form of a 16 00:00:34,262 --> 00:00:35,604 hypothesis, where H of X 17 00:00:35,604 --> 00:00:38,319 is theta transpose X. For 18 00:00:38,330 --> 00:00:39,831 logistic regression, I'm going 19 00:00:39,831 --> 00:00:41,075 to modify this a little 20 00:00:41,075 --> 00:00:43,352 bit, and make the hypothesis 21 00:00:43,360 --> 00:00:46,218 G of theta transpose X, 22 00:00:46,218 --> 00:00:47,711 where I'm going to define 23 00:00:47,711 --> 00:00:50,693 the function G as follows: 24 00:00:50,693 --> 00:00:51,926 G of Z if Z 25 00:00:51,926 --> 00:00:53,633 is a real number is equal 26 00:00:53,640 --> 00:00:55,640 to one over one plus 27 00:00:55,640 --> 00:00:58,480 E to the negative Z. This 28 00:00:58,490 --> 00:01:01,716 called the sigmoid function 29 00:01:01,720 --> 00:01:04,843 or the logistic function. 30 00:01:04,843 --> 00:01:07,089 And the term logistic function, 31 00:01:07,120 --> 00:01:11,103 that's what give rise to the name logistic progression. 32 00:01:11,103 --> 00:01:12,781 And, by the way, the terms 33 00:01:12,781 --> 00:01:14,551 sigmoid function and logistic 34 00:01:14,551 --> 00:01:16,996 function are basically synonyms 35 00:01:16,996 --> 00:01:18,362 and mean the same thing. 36 00:01:18,362 --> 00:01:19,756 So the two terms are 37 00:01:19,756 --> 00:01:21,893 basically interchangeable and either 38 00:01:21,893 --> 00:01:23,160 term can be used to 39 00:01:23,160 --> 00:01:24,620 refer to this function 40 00:01:24,620 --> 00:01:26,283 G. And if we 41 00:01:26,283 --> 00:01:27,734 take these two equations, and 42 00:01:27,734 --> 00:01:30,089 put them together, then here's 43 00:01:30,089 --> 00:01:32,354 just an alternative way of 44 00:01:32,354 --> 00:01:34,843 writing out the form of my hypothesis. 45 00:01:34,843 --> 00:01:36,533 I'm saying that H of x 46 00:01:36,540 --> 00:01:38,933 is one over one plus 47 00:01:38,933 --> 00:01:41,765 E to the negative theta transpose 48 00:01:41,765 --> 00:01:43,106 X, and all I've done is 49 00:01:43,106 --> 00:01:45,353 I've taken the variable 50 00:01:45,353 --> 00:01:46,700 Z, Z here's a 51 00:01:46,760 --> 00:01:48,173 real number and plugged in 52 00:01:48,173 --> 00:01:50,201 theta transpose X, so 53 00:01:50,201 --> 00:01:52,560 I end up with, you know, theta transpose 54 00:01:52,560 --> 00:01:54,933 X, in place of Z there. 55 00:01:54,940 --> 00:01:57,949 Lastly, let me show you where the sigmoid function looks like. 56 00:01:57,949 --> 00:02:00,296 We're going to plot it on this figure here. 57 00:02:00,296 --> 00:02:02,022 The sigmoid function, G of 58 00:02:02,022 --> 00:02:04,652 Z, also called the logistic function, looks like this. 59 00:02:04,652 --> 00:02:07,078 It starts off near zero and 60 00:02:07,078 --> 00:02:09,366 then rises until it processes 61 00:02:09,366 --> 00:02:13,473 0.5 at the origin and then it flattens out again like so. 62 00:02:13,500 --> 00:02:16,051 So that's what the sigmoid function looks like. 63 00:02:16,051 --> 00:02:17,898 And you notice that the 64 00:02:17,898 --> 00:02:19,725 sigmoid function, well, it 65 00:02:19,740 --> 00:02:21,894 asymptotes at one, and 66 00:02:21,894 --> 00:02:24,256 asymptotes at zero as 67 00:02:24,256 --> 00:02:26,388 Z against the horizontal axis 68 00:02:26,388 --> 00:02:27,659 is Z. As Z goes to 69 00:02:27,659 --> 00:02:29,304 minus infinity, G of 70 00:02:29,304 --> 00:02:31,396 Z approaches zero and as 71 00:02:31,396 --> 00:02:33,816 G of Z approaches infinity, G 72 00:02:33,816 --> 00:02:35,864 of Z approaches 1, and 73 00:02:35,880 --> 00:02:37,252 so because G of 74 00:02:37,252 --> 00:02:39,408 Z offers values that are 75 00:02:39,408 --> 00:02:41,696 between 0 and 1 we 76 00:02:41,730 --> 00:02:44,592 also have that H of 77 00:02:44,610 --> 00:02:47,141 X must be between 0 and 1. 78 00:02:47,141 --> 00:02:50,029 Finally, given this hypothesis 79 00:02:50,040 --> 00:02:52,123 representation, what we 80 00:02:52,123 --> 00:02:53,740 need to do, as before, 81 00:02:53,740 --> 00:02:58,841 is fit the parameters theta to our data. 82 00:02:58,841 --> 00:03:00,490 So given a training set, we 83 00:03:00,490 --> 00:03:01,743 need to pick a value for 84 00:03:01,743 --> 00:03:03,773 the parameters theta and this 85 00:03:03,773 --> 00:03:06,981 hypothesis will then let us make predictions. 86 00:03:06,981 --> 00:03:08,534 We'll talk about a learning algorithm 87 00:03:08,534 --> 00:03:11,828 later for fitting the parameters theta. 88 00:03:11,828 --> 00:03:13,506 But first let's talk a 89 00:03:13,506 --> 00:03:17,379 bit about the interpretation of this model. 90 00:03:17,640 --> 00:03:19,612 Here's how I'm going to 91 00:03:19,620 --> 00:03:21,660 interpret the output of 92 00:03:21,660 --> 00:03:23,637 my hypothesis H of 93 00:03:23,637 --> 00:03:26,387 X. When my hypothesis 94 00:03:26,400 --> 00:03:28,238 outputs some number, I am 95 00:03:28,240 --> 00:03:30,126 going to treat that number as 96 00:03:30,126 --> 00:03:33,400 the estimated probability that Y 97 00:03:33,400 --> 00:03:35,170 is equal to one on a 98 00:03:35,170 --> 00:03:38,266 new input example X. Here is what I mean. 99 00:03:38,266 --> 00:03:40,324 Here is an example. 100 00:03:40,324 --> 00:03:43,932 Let's say we're using the tumor classification example. 101 00:03:43,932 --> 00:03:45,234 So we may have a feature 102 00:03:45,234 --> 00:03:47,945 vector X, which is this x01 103 00:03:47,945 --> 00:03:49,860 as always and then our 104 00:03:49,860 --> 00:03:52,836 one feature is the size of the tumor. 105 00:03:52,836 --> 00:03:54,045 Suppose I have a patient come 106 00:03:54,045 --> 00:03:55,459 in and, you know they have some 107 00:03:55,459 --> 00:03:57,183 tumor size and I 108 00:03:57,183 --> 00:03:58,759 feed their feature vector X 109 00:03:58,759 --> 00:04:00,963 into my hypothesis and suppose 110 00:04:00,970 --> 00:04:03,760 my hypothesis outputs the number 0.7. 111 00:04:03,760 --> 00:04:05,758 I'm going to interpret 112 00:04:05,758 --> 00:04:07,298 my hypothesis as follows. 113 00:04:07,298 --> 00:04:08,790 I'm going to say that this 114 00:04:08,790 --> 00:04:10,235 hypothesis is telling me 115 00:04:10,235 --> 00:04:12,143 that for a patient with 116 00:04:12,143 --> 00:04:14,490 features X, the probability 117 00:04:14,520 --> 00:04:16,772 that Y equals one is 0 .7. 118 00:04:16,772 --> 00:04:18,703 In other words, I'm going 119 00:04:18,720 --> 00:04:21,106 to tell my patient that the 120 00:04:21,106 --> 00:04:23,320 tumor, sadly, has 121 00:04:23,320 --> 00:04:27,836 a 70% chance or a 0.7 chance of being malignant. 122 00:04:27,860 --> 00:04:29,420 To write this out slightly more 123 00:04:29,420 --> 00:04:30,473 formally or to write this 124 00:04:30,480 --> 00:04:31,763 out in math, I'm going to 125 00:04:31,763 --> 00:04:34,803 interpret my hypothesis output 126 00:04:34,820 --> 00:04:37,144 as P of y 127 00:04:37,150 --> 00:04:39,913 equals 1, given X 128 00:04:39,913 --> 00:04:41,813 parametrized by theta. 129 00:04:41,830 --> 00:04:43,389 So, for those of you that are 130 00:04:43,389 --> 00:04:45,320 familiar with probability, this equation 131 00:04:45,320 --> 00:04:46,766 might make sense, if you're a little less familiar 132 00:04:46,766 --> 00:04:48,673 with probability, you know, here's 133 00:04:48,673 --> 00:04:51,564 how I read this expression, this 134 00:04:51,580 --> 00:04:53,215 is the probability that y is 135 00:04:53,215 --> 00:04:54,988 equals to one, given x 136 00:04:54,988 --> 00:04:56,493 instead of given that my patient 137 00:04:56,493 --> 00:04:58,027 has, you know, features X. 138 00:04:58,040 --> 00:04:59,860 Given my patient has a particular 139 00:04:59,860 --> 00:05:01,575 tumor size represented by my 140 00:05:01,575 --> 00:05:03,156 features X, and this 141 00:05:03,156 --> 00:05:06,956 probability is parametrized by theta. 142 00:05:07,130 --> 00:05:09,166 So I'm basically going to count 143 00:05:09,166 --> 00:05:11,009 on my hypothesis to give 144 00:05:11,009 --> 00:05:13,332 me estimates of the probability 145 00:05:13,332 --> 00:05:15,349 that Y is equal to 1. 146 00:05:15,349 --> 00:05:16,523 Now since this is a 147 00:05:16,523 --> 00:05:18,629 classification task, we know 148 00:05:18,640 --> 00:05:21,497 that Y must be either zero or one, right? 149 00:05:21,497 --> 00:05:23,373 Those are the only two values 150 00:05:23,390 --> 00:05:25,466 that Y could possibly take on, 151 00:05:25,466 --> 00:05:26,654 either in the training set or 152 00:05:26,654 --> 00:05:28,077 for new patients that may walk 153 00:05:28,077 --> 00:05:32,014 into my office or into the doctor's office in the future. 154 00:05:32,014 --> 00:05:33,529 So given H of X, 155 00:05:33,550 --> 00:05:36,153 we can therefore compute the probability 156 00:05:36,153 --> 00:05:39,116 that Y is equal to zero as well. 157 00:05:39,116 --> 00:05:41,209 Concretely, because Y must 158 00:05:41,250 --> 00:05:43,065 be either zero or one, 159 00:05:43,070 --> 00:05:45,141 we know that the probability 160 00:05:45,141 --> 00:05:46,329 of Y equals zero, plus the 161 00:05:46,329 --> 00:05:47,512 probability of Y equals 162 00:05:47,550 --> 00:05:50,173 one, must add up to one. 163 00:05:50,173 --> 00:05:51,483 This first equation looks a 164 00:05:51,483 --> 00:05:52,828 little bit more complicated but it's 165 00:05:52,828 --> 00:05:54,603 basically saying that probability of 166 00:05:54,610 --> 00:05:56,287 Y equals zero for a 167 00:05:56,320 --> 00:05:58,319 particular patient with features x, and 168 00:05:58,360 --> 00:06:01,002 you know, given our parameter's theta, plus the 169 00:06:01,010 --> 00:06:02,305 probability of Y equals one for 170 00:06:02,305 --> 00:06:04,470 that same patient which features x and you 171 00:06:04,471 --> 00:06:06,334 parameters theta must add 172 00:06:06,360 --> 00:06:08,260 up to one, if this equation 173 00:06:08,260 --> 00:06:10,171 looks a little bit complicated feel free 174 00:06:10,200 --> 00:06:14,049 to mentally imagine it without that X and theta. 175 00:06:14,049 --> 00:06:15,476 And this is just saying that 176 00:06:15,480 --> 00:06:16,993 the probability of Y equals zero plus 177 00:06:16,993 --> 00:06:19,272 the probability of Y equals one must be equal to one. 178 00:06:19,280 --> 00:06:20,365 And we know this to be 179 00:06:20,365 --> 00:06:23,120 true because Y has to be either zero or one. 180 00:06:23,120 --> 00:06:24,240 And so the chance of Y 181 00:06:24,240 --> 00:06:25,918 being zero plus the chance 182 00:06:25,930 --> 00:06:29,547 that Y is one, you know, those two must add up to one. 183 00:06:29,547 --> 00:06:31,387 And so if you just 184 00:06:31,440 --> 00:06:33,780 take this term and move 185 00:06:33,780 --> 00:06:35,409 it to the right-hand side, then 186 00:06:35,409 --> 00:06:37,327 you end up with this equation 187 00:06:37,327 --> 00:06:38,995 that says probability Y equals zero 188 00:06:38,995 --> 00:06:40,502 is one minus probability y equals 189 00:06:40,530 --> 00:06:43,548 and thus if our 190 00:06:43,560 --> 00:06:46,009 hypothesis if H of X 191 00:06:46,009 --> 00:06:47,775 gives us that term you can 192 00:06:47,790 --> 00:06:49,948 therefore quite simply compute the 193 00:06:49,948 --> 00:06:51,508 probability, or compute the 194 00:06:51,510 --> 00:06:53,282 estimated probability that Y 195 00:06:53,282 --> 00:06:55,411 is equal to zero as well. 196 00:06:55,411 --> 00:06:56,720 So you now know what 197 00:06:56,720 --> 00:06:59,779 the hypothesis representation is for 198 00:06:59,790 --> 00:07:01,576 logistic regression and we're seeing 199 00:07:01,580 --> 00:07:03,534 what the mathematical formula is 200 00:07:03,534 --> 00:07:06,701 defining the hypothesis for logistic regression. 201 00:07:06,701 --> 00:07:07,880 In the next video, I'd like 202 00:07:07,880 --> 00:07:09,018 to try to give you 203 00:07:09,040 --> 00:07:11,091 better intuition about what the 204 00:07:11,091 --> 00:07:12,518 hypothesis function looks like. 205 00:07:12,518 --> 00:07:13,606 And I want to tell 206 00:07:13,620 --> 00:07:15,294 you something called the decision 207 00:07:15,294 --> 00:07:16,700 boundary and we'll look 208 00:07:16,700 --> 00:07:18,846 at some visualizations together to 209 00:07:18,846 --> 00:07:20,186 try to get a better sense 210 00:07:20,186 --> 00:07:22,370 of what this hypothesis function of 211 00:07:22,370 --> 00:07:24,697 logistic regression really looks like.