1 00:00:00,000 --> 00:00:04,100 In the previous video, we gave the mathematical definition of the cost 2 00:00:04,100 --> 00:00:08,616 function. In this video, let's look at some examples, to get back to intuition 3 00:00:08,616 --> 00:00:14,466 about what the cost function is doing, and why we want to use it. To recap, here's 4 00:00:14,466 --> 00:00:19,396 what we had last time. We want to fit a straight line to our data, so we had this 5 00:00:19,396 --> 00:00:23,958 formed as a hypothesis with these parameters theta zero and theta one, and 6 00:00:23,958 --> 00:00:28,888 with different choices of the parameters we end up with different straight line 7 00:00:31,323 --> 00:00:33,758 fits. So the data which are fit like so, and there's a cost function, and 8 00:00:33,758 --> 00:00:38,554 that was our optimization objective. [sound] So this video, in order to better 9 00:00:38,554 --> 00:00:43,293 visualize the cost function J, I'm going to work with a simplified hypothesis 10 00:00:43,293 --> 00:00:48,220 function, like that shown on the right. So I'm gonna use my simplified hypothesis, 11 00:00:48,220 --> 00:00:53,275 which is just theta one times X. We can, if you want, think of this as setting the 12 00:00:53,275 --> 00:00:58,721 parameter theta zero equal to 0. So I have only one parameter theta one and 13 00:00:58,721 --> 00:01:04,372 my cost function is similar to before except that now H of X that is now equal 14 00:01:04,372 --> 00:01:10,309 to just theta one times X. And I have only one parameter theta one and so my 15 00:01:10,309 --> 00:01:16,246 optimization objective is to minimize j of theta one. In pictures what this means is 16 00:01:16,246 --> 00:01:21,611 that if theta zero equals zero that corresponds to choosing only hypothesis 17 00:01:21,611 --> 00:01:27,176 functions that pass through the origin, that pass through the point (0, 0). Using 18 00:01:27,176 --> 00:01:33,415 this simplified definition of a hypothesizing cost function let's try to understand the cost 19 00:01:33,415 --> 00:01:40,178 function concept better. It turns out that two key functions we want to understand. 20 00:01:40,178 --> 00:01:46,432 The first is the hypothesis function, and the second is a cost function. So, notice 21 00:01:46,432 --> 00:01:52,068 that the hypothesis, right, H of X. For a face value of theta one, this is a 22 00:01:52,068 --> 00:01:58,168 function of X. So the hypothesis is a function of, what is the size of the house 23 00:01:58,168 --> 00:02:03,959 X. In contrast, the cost function, J, that's a function of the parameter, theta 24 00:02:03,959 --> 00:02:09,993 one, which controls the slope of the straight line. Let's plot these functions 25 00:02:09,993 --> 00:02:15,481 and try to understand them both better. Let's start with the hypothesis. On the left, 26 00:02:15,481 --> 00:02:20,283 let's say here's my training set with three points at (1, 1), (2, 2), and (3, 3). Let's 27 00:02:20,283 --> 00:02:25,338 pick a value theta one, so when theta one equals one, and if that's my choice for 28 00:02:25,338 --> 00:02:30,392 theta one, then my hypothesis is going to look like this straight line over here. 29 00:02:30,392 --> 00:02:35,234 And I'm gonna point out, when I'm plotting my hypothesis function. X-axis, my 30 00:02:35,234 --> 00:02:40,525 horizontal axis is labeled X, is labeled you know, size of the house over here. 31 00:02:40,525 --> 00:02:46,551 Now, of temporary, set theta one equals one, what I want to do is 32 00:02:46,551 --> 00:02:52,430 figure out what is j of theta one, when theta one equals one. So let's go ahead 33 00:02:52,430 --> 00:02:58,781 and compute what the cost function has for. You'll devalue one. Well, as usual, 34 00:02:58,781 --> 00:03:05,761 my cost function is defined as follows, right? Some from, some of 'em are training 35 00:03:05,761 --> 00:03:13,840 sets of this usual squared error term. And, this is therefore equal to. And this. 36 00:03:14,740 --> 00:03:25,066 Of theta one x I minus y I and if you simplify this turns out to be. That. Zero 37 00:03:25,066 --> 00:03:31,995 Squared to zero squared to zero squared which is of course, just equal to zero. Now, 38 00:03:31,995 --> 00:03:39,098 inside the cost function. It turns out each of these terms here is equal to zero. Because 39 00:03:39,098 --> 00:03:46,288 for the specific training set I have or my 3 training examples are (1, 1), (2, 2), (3,3). If theta 40 00:03:46,288 --> 00:03:54,667 one is equal to one. Then h of x. H of x i. Is equal to y I exactly, let me write 41 00:03:54,667 --> 00:04:04,164 this better. Right? And so, h of x minus y, each of these terms is equal to zero, 42 00:04:04,164 --> 00:04:14,821 which is why I find that j of one is equal to zero. So, we now know that j of one Is 43 00:04:14,821 --> 00:04:20,504 equal to zero. Let's plot that. What I'm gonna do on the right is plot my cost 44 00:04:20,504 --> 00:04:26,187 function j. And notice, because my cost function is a function of my parameter 45 00:04:26,187 --> 00:04:32,017 theta one, when I plot my cost function, the horizontal axis is now labeled with 46 00:04:32,017 --> 00:04:38,069 theta one. So I have j of one zero zero so let's go ahead and plot that. End 47 00:04:38,069 --> 00:04:46,464 up with. An X over there. Now lets look at some other examples. Theta-1 can take on a 48 00:04:46,464 --> 00:04:52,470 range of different values. Right? So theta-1 can take on the negative values, 49 00:04:52,470 --> 00:04:58,876 zero, positive values. So what if theta-1 is equal to 0.5. What happens then? Let's 50 00:04:58,876 --> 00:05:05,442 go ahead and plot that. I'm now going to set theta-1 equals 0.5, and in that case my 51 00:05:05,442 --> 00:05:11,688 hypothesis now looks like this. As a line with slope equals to 0.5, and, lets 52 00:05:11,688 --> 00:05:17,855 compute J, of 0.5. So that is going to be one over 2M of, my usual cost function. 53 00:05:17,855 --> 00:05:23,769 It turns out that the cost function is going to be the sum of square values of 54 00:05:23,769 --> 00:05:29,609 the height of this line. Plus the sum of square of the height of that line, plus 55 00:05:29,609 --> 00:05:34,783 the sum of square of the height of that line, right? ?Cause just this vertical 56 00:05:34,783 --> 00:05:42,854 distance, that's the difference between, you know, Y. I. and the predicted value, H 57 00:05:42,854 --> 00:05:48,772 of XI, right? So the first example is going to be 0.5 minus one squared. 58 00:05:49,033 --> 00:05:55,647 Because my hypothesis predicted 0.5. Whereas, the actual value was one. For my 59 00:05:55,647 --> 00:06:02,436 second example, I get, one minus two squared, because my hypothesis predicted 60 00:06:02,436 --> 00:06:09,663 one, but the actual housing price was two. And then finally, plus. 1.5 minus three 61 00:06:09,663 --> 00:06:17,263 squared. And so that's equal to one over two times three. Because, M when trading 62 00:06:17,263 --> 00:06:24,274 set size, right, have three training examples. In that, that's times 63 00:06:24,274 --> 00:06:33,011 simplifying for the parentheses it's 3.5. So that's 3.5 over six which is about 64 00:06:33,011 --> 00:06:41,085 0.68. So now we know that j of 0.5 is about 0.68. Lets go and plot that. Oh 65 00:06:41,085 --> 00:06:50,308 excuse me, math error, it's actually 0.58. So we plot that which is maybe about over 66 00:06:50,308 --> 00:07:00,293 there. Okay? Now, let's do one more. How about if theta one is equal to zero, what 67 00:07:00,293 --> 00:07:08,975 is J of zero equal to? It turns out that if theta one is equal to zero, then H of X 68 00:07:08,975 --> 00:07:16,916 is just equal to, you know, this flat line, right, that just goes horizontally 69 00:07:16,916 --> 00:07:26,882 like this. And so, measuring the errors. We have that J of zero is equal to one 70 00:07:26,882 --> 00:07:34,659 over two M, times one squared plus two squared plus three squared, which is, One 71 00:07:34,659 --> 00:07:41,555 six times fourteen which is about 2.3. So let's go ahead and plot as well. So it 72 00:07:41,555 --> 00:07:47,622 ends up with a value around 2.3 and of course we can keep on doing this 73 00:07:47,622 --> 00:07:53,335 for other values of theta one. It turns out that you can have you know negative 74 00:07:53,335 --> 00:07:59,327 values of theta one as well so if theta one is negative then h of x would be equal 75 00:07:59,327 --> 00:08:05,179 to say minus 0.5 times x then theta one is minus 0.5 and so that corresponds 76 00:08:05,179 --> 00:08:10,188 to a hypothesis with a slope of negative 0.5. And you can 77 00:08:10,188 --> 00:08:15,694 actually keep on computing these errors. This turns out to be, you know, for 0.5, 78 00:08:15,694 --> 00:08:21,520 it turns out to have really high error. It works out to be something, like, 5.25. And 79 00:08:21,520 --> 00:08:28,087 so on, and the different values of theta one, you can compute these things, right? 80 00:08:28,087 --> 00:08:34,413 And it turns out that you, your computed range of values, you get something like 81 00:08:34,413 --> 00:08:40,499 that. And by computing the range of values, you can actually slowly create 82 00:08:40,499 --> 00:08:50,999 out. What does function J of Theta say and that's what J of Theta is. To recap, for 83 00:08:50,999 --> 00:08:57,851 each value of theta one, right? Each value of theta one corresponds to a different 84 00:08:57,851 --> 00:09:04,448 hypothesis, or to a different straight line fit on the left. And for each value 85 00:09:04,448 --> 00:09:11,723 of theta one, we could then derive a different value of j of theta one. And for 86 00:09:11,723 --> 00:09:19,354 example, you know, theta one=1, corresponded to this straight line 87 00:09:19,354 --> 00:09:27,846 straight through the data. Whereas theta one=0.5. And this point shown in magenta 88 00:09:27,846 --> 00:09:35,340 corresponded to maybe that line, and theta one=zero which is shown in blue that corresponds to 89 00:09:35,340 --> 00:09:41,527 this horizontal line. Right, so for each value of theta one we wound up with a 90 00:09:41,527 --> 00:09:48,516 different value of J of theta one and we could then use this to trace out this plot 91 00:09:48,516 --> 00:09:54,461 on the right. Now you remember, the optimization objective for our learning 92 00:09:54,461 --> 00:10:01,963 algorithm is we want to choose the value of theta one. That minimizes J of theta one. 93 00:10:01,963 --> 00:10:08,076 Right? This was our objective function for the linear regression. Well, looking at 94 00:10:08,076 --> 00:10:13,710 this curve, the value that minimizes j of theta one is, you know, theta one equals 95 00:10:13,710 --> 00:10:19,132 to one. And low and behold, that is indeed the best possible straight line fit 96 00:10:19,132 --> 00:10:24,624 through our data, by setting theta one equals one. And just, for this particular 97 00:10:24,624 --> 00:10:30,328 training set, we actually end up fitting it perfectly. And that's why minimizing j 98 00:10:30,328 --> 00:10:36,447 of theta one corresponds to finding a straight line that fits the data well. So, 99 00:10:36,447 --> 00:10:40,884 to wrap up. In this video, we looked up some plots. To understand the cost 100 00:10:40,884 --> 00:10:45,259 function. To do so, we simplify the algorithm. So that it only had one 101 00:10:45,259 --> 00:10:50,258 parameter theta one. And we set the parameter theta zero to be only zero. In the next video. 102 00:10:50,258 --> 00:10:54,445 We'll go back to the original problem formulation and look at some 103 00:10:54,445 --> 00:10:59,570 visualizations involving both theta zero and theta one. That is without setting theta 104 00:10:59,570 --> 00:11:04,757 zero to zero. And hopefully that will give you, an even better sense of what the cost 105 00:11:04,757 --> 00:11:09,257 function j is doing in the original linear regression formulation.