1 00:00:00,270 --> 00:00:01,380 Neural Networks are one of 2 00:00:01,450 --> 00:00:03,610 the most powerful learning algorithms that we have today. 3 00:00:04,310 --> 00:00:05,490 In this, and in the 4 00:00:05,590 --> 00:00:06,690 next few videos, I'd like to 5 00:00:06,760 --> 00:00:08,030 start talking about a learning 6 00:00:08,380 --> 00:00:09,920 algorithm for fitting the parameters 7 00:00:10,630 --> 00:00:12,530 of the neural network given the training set. 8 00:00:13,460 --> 00:00:14,840 As for the discussion of most 9 00:00:15,020 --> 00:00:16,300 of the learning algorithms, we're going 10 00:00:16,450 --> 00:00:17,860 to begin by talking about the 11 00:00:17,960 --> 00:00:20,510 cost function for fitting the parameters of the network. 12 00:00:21,170 --> 00:00:22,650 I'm going to focus 13 00:00:23,270 --> 00:00:24,790 on the application of neural 14 00:00:25,060 --> 00:00:26,590 networks to classification problems. 15 00:00:27,660 --> 00:00:28,860 So, suppose we have a 16 00:00:29,120 --> 00:00:31,300 network like that shown on the left. 17 00:00:31,530 --> 00:00:32,510 And suppose we have a training 18 00:00:32,710 --> 00:00:33,850 set like this of this of 19 00:00:33,980 --> 00:00:36,550 Xi, Yi pairs of m training examples. 20 00:00:37,920 --> 00:00:39,040 I'm going to use upper case 21 00:00:39,450 --> 00:00:40,640 L to denote the 22 00:00:40,790 --> 00:00:42,460 total number of layers in this network. 23 00:00:43,330 --> 00:00:44,550 So, for the network shown 24 00:00:44,810 --> 00:00:45,720 on the left, we would have 25 00:00:46,370 --> 00:00:47,920 capital L equals 4. 26 00:00:48,020 --> 00:00:48,910 And, I'm going to use 27 00:00:49,180 --> 00:00:50,740 s subscript L, to denote 28 00:00:51,260 --> 00:00:52,540 the number of units, that is 29 00:00:52,730 --> 00:00:54,490 a number of neurons, not counting 30 00:00:54,770 --> 00:00:57,180 the bias unit in layer L of the network. 31 00:00:57,900 --> 00:00:59,440 So, for example, we would 32 00:00:59,580 --> 00:01:01,280 have a S1 which 33 00:01:01,370 --> 00:01:03,330 is the input layer equals S3 unit, 34 00:01:04,140 --> 00:01:05,970 S2 in my example is five units. 35 00:01:06,900 --> 00:01:08,670 And the output layer S4. 36 00:01:09,940 --> 00:01:12,820 Which is also equals SL, because capital L is equal to four. 37 00:01:12,990 --> 00:01:14,290 The output layer in my 38 00:01:14,450 --> 00:01:16,230 example in the left has four units. 39 00:01:17,630 --> 00:01:19,880 We're going to consider two types of classification problems. 40 00:01:20,430 --> 00:01:21,780 The first is binary classification, 41 00:01:22,970 --> 00:01:25,550 where the labels y are either zero or one. 42 00:01:26,240 --> 00:01:28,540 In this case, we would have one output unit. 43 00:01:29,140 --> 00:01:30,260 So, this neural network on top 44 00:01:30,510 --> 00:01:32,430 has four output units, but if 45 00:01:32,570 --> 00:01:33,960 we had binary classification, we would 46 00:01:34,120 --> 00:01:35,810 have only one output unit 47 00:01:36,720 --> 00:01:38,360 that computes h of x. 48 00:01:40,310 --> 00:01:41,550 And the outputs of the 49 00:01:41,630 --> 00:01:42,960 neural network would be h 50 00:01:43,140 --> 00:01:45,580 of x is going to be a real number. 51 00:01:46,900 --> 00:01:47,980 And in this case the number 52 00:01:48,360 --> 00:01:50,240 of output units, SL, where 53 00:01:50,480 --> 00:01:51,880 L is again the index 54 00:01:52,300 --> 00:01:53,970 of the final layer because that's 55 00:01:54,240 --> 00:01:55,630 the number of layers we have in the network. 56 00:01:56,570 --> 00:01:57,960 So the number of units we 57 00:01:58,110 --> 00:02:00,060 have in the output layer is going to be equal to one. 58 00:02:01,040 --> 00:02:02,430 In this case, to simplify notation 59 00:02:02,950 --> 00:02:05,340 later, I'm also going to set k equals 1. 60 00:02:05,460 --> 00:02:06,560 So, you can think of k as 61 00:02:06,770 --> 00:02:08,240 also denoting the number 62 00:02:08,700 --> 00:02:10,780 of units in the output layer. 63 00:02:11,410 --> 00:02:12,980 The second type of classification problem 64 00:02:13,280 --> 00:02:15,160 we'll consider will be multiclass classification 65 00:02:15,780 --> 00:02:18,020 problem where we may have k distinct classes. 66 00:02:19,160 --> 00:02:20,760 So, our early example, I 67 00:02:21,070 --> 00:02:22,530 had this representation for y 68 00:02:23,080 --> 00:02:24,900 if we have four classes and 69 00:02:25,160 --> 00:02:27,050 in this case, we would have caps lock K 70 00:02:27,340 --> 00:02:29,530 output units and our hypotheses 71 00:02:30,350 --> 00:02:33,720 will output vectors that are K dimensional. 72 00:02:34,980 --> 00:02:36,230 And the number of output units 73 00:02:36,760 --> 00:02:38,390 will be equal to K. 74 00:02:39,000 --> 00:02:40,020 And usually we will have 75 00:02:40,370 --> 00:02:41,620 K greater than or equal 76 00:02:41,820 --> 00:02:42,960 to three in this case, because 77 00:02:43,980 --> 00:02:45,340 if we had two classes then, 78 00:02:45,710 --> 00:02:46,560 you know, we don't need to 79 00:02:46,690 --> 00:02:48,330 use the one versus all method. 80 00:02:48,720 --> 00:02:49,640 We need to use the one versus 81 00:02:49,970 --> 00:02:50,950 all method only if we 82 00:02:51,110 --> 00:02:52,460 have K greater than or 83 00:02:52,740 --> 00:02:54,250 equal to three classes so we 84 00:02:54,470 --> 00:02:56,100 only have two classes we will 85 00:02:56,180 --> 00:02:57,670 need to use only one output unit. 86 00:02:58,250 --> 00:03:00,870 Now, let's define the cost function for our cost function for our neural network. 87 00:03:03,880 --> 00:03:05,130 The cost function we use for 88 00:03:05,240 --> 00:03:06,530 the neural network is going to 89 00:03:06,680 --> 00:03:08,300 be a generalization of the 90 00:03:08,360 --> 00:03:09,340 one that we use for logistic 91 00:03:09,510 --> 00:03:11,500 regression. For logistic regression, 92 00:03:12,100 --> 00:03:13,440 we used to minimize the 93 00:03:13,510 --> 00:03:14,490 cost function j of theta 94 00:03:15,270 --> 00:03:16,550 that was minus 1 over 95 00:03:16,770 --> 00:03:17,760 m of this cost function 96 00:03:18,720 --> 00:03:20,570 and then plus this extra regularization 97 00:03:21,300 --> 00:03:22,660 term here, where this was 98 00:03:22,850 --> 00:03:24,020 a sum from j equals 99 00:03:24,700 --> 00:03:26,190 1 through n, because we 100 00:03:26,270 --> 00:03:29,760 did not regularize the bias term theta zero. 101 00:03:31,030 --> 00:03:32,590 For a neural network our cost 102 00:03:32,910 --> 00:03:34,490 function is going to be a generalization of this. 103 00:03:35,650 --> 00:03:37,060 Where instead of having basically 104 00:03:37,530 --> 00:03:39,360 just one logistic regression output 105 00:03:39,650 --> 00:03:41,650 unit, we may instead have K of them. 106 00:03:42,590 --> 00:03:43,520 So here's our cost function. 107 00:03:44,770 --> 00:03:46,300 Neural network now outputs vectors 108 00:03:46,720 --> 00:03:47,920 in RK where K might 109 00:03:48,170 --> 00:03:48,830 be equal to 1 if we 110 00:03:49,200 --> 00:03:50,350 have the binary classification problem. 111 00:03:51,380 --> 00:03:52,240 I'm going to use this notation, 112 00:03:53,300 --> 00:03:56,470 h of x subscript i, to denote the ith output. 113 00:03:57,440 --> 00:03:59,860 That is h of x is a K dimensional vector. 114 00:04:00,840 --> 00:04:02,590 And so this subscript i just 115 00:04:02,960 --> 00:04:04,400 selects out the ith element 116 00:04:05,200 --> 00:04:07,510 of the vector that is output by my neural network. 117 00:04:08,900 --> 00:04:10,050 My cost function, j of 118 00:04:10,180 --> 00:04:11,580 theta is now going 119 00:04:11,760 --> 00:04:13,790 to be the following is minus one 120 00:04:13,940 --> 00:04:14,850 over m of a sum 121 00:04:15,420 --> 00:04:16,780 of a similar term to what 122 00:04:16,960 --> 00:04:18,990 we have in logistic regression. Except that 123 00:04:19,300 --> 00:04:20,360 we have this sum from K 124 00:04:21,020 --> 00:04:22,490 equals one through K. The 125 00:04:22,600 --> 00:04:23,650 summation is basically a 126 00:04:23,720 --> 00:04:25,580 sum over my K output unit. 127 00:04:26,060 --> 00:04:28,290 So, if I have four upper units. 128 00:04:29,400 --> 00:04:30,740 That is the final layer of my 129 00:04:30,850 --> 00:04:32,530 neural network has four output 130 00:04:32,860 --> 00:04:34,420 units then this sum 131 00:04:34,700 --> 00:04:35,680 from, this is a sum from 132 00:04:35,900 --> 00:04:37,140 K equals one through four 133 00:04:38,050 --> 00:04:40,550 of basically the logistic regression algorithms 134 00:04:42,070 --> 00:04:43,640 cost function but summing 135 00:04:43,750 --> 00:04:45,570 that cost function over each 136 00:04:45,890 --> 00:04:47,120 of my four output units in turn. 137 00:04:47,800 --> 00:04:48,970 And so, you notice 138 00:04:49,380 --> 00:04:50,700 in particular that this applies 139 00:04:51,400 --> 00:04:53,530 to YK, HK, because 140 00:04:53,740 --> 00:04:55,040 we're basically taking the K 141 00:04:55,500 --> 00:04:57,020 upper unit and comparing that 142 00:04:57,780 --> 00:04:59,590 to the value of YK, which 143 00:04:59,810 --> 00:05:02,020 is you know, which is 144 00:05:02,210 --> 00:05:03,260 that one of those vectors 145 00:05:03,740 --> 00:05:05,110 to say what cause it should be. 146 00:05:06,280 --> 00:05:08,060 And finally, the second term 147 00:05:08,360 --> 00:05:09,490 here is the regularization 148 00:05:10,440 --> 00:05:12,970 term similar to what we had for logistic regression. 149 00:05:14,080 --> 00:05:15,640 This summation terms looks really 150 00:05:15,850 --> 00:05:17,370 complicated and always doing 151 00:05:17,840 --> 00:05:19,460 is a summing over these terms, 152 00:05:19,950 --> 00:05:21,670 theta j i l for 153 00:05:21,860 --> 00:05:23,340 all values of i j 154 00:05:23,410 --> 00:05:24,830 and l. Except that we 155 00:05:25,010 --> 00:05:26,340 don't sum over the terms 156 00:05:26,710 --> 00:05:28,210 corresponding to these bias values 157 00:05:28,900 --> 00:05:30,000 like we had for logistic progression. 158 00:05:30,900 --> 00:05:32,080 Concretely, we don't sum 159 00:05:32,240 --> 00:05:33,590 over the terms corresponding 160 00:05:34,300 --> 00:05:36,290 to where i is equal to zero. 161 00:05:36,780 --> 00:05:38,310 So, that is because 162 00:05:38,920 --> 00:05:40,010 when we are computing the activation 163 00:05:40,590 --> 00:05:41,930 of the neuron, we have terms 164 00:05:42,280 --> 00:05:43,630 like these, you know theta, i0 165 00:05:43,810 --> 00:05:47,860 plus theta, i1, 166 00:05:48,160 --> 00:05:50,410 x1 plus, and 167 00:05:50,520 --> 00:05:51,780 so on, where I guess 168 00:05:52,020 --> 00:05:53,310 we could have a 2 there 169 00:05:53,490 --> 00:05:54,420 if this is the first hidden layer, 170 00:05:55,250 --> 00:05:56,800 and so the values with 171 00:05:57,230 --> 00:05:58,730 the 0 there at that corresponds to 172 00:05:58,730 --> 00:06:00,110 something that multiplies into an 173 00:06:00,260 --> 00:06:01,460 x0 or an a0 and 174 00:06:02,210 --> 00:06:02,950 so, this is kind of like 175 00:06:03,120 --> 00:06:04,810 a bias unit and by 176 00:06:04,980 --> 00:06:06,020 analogy to what we were 177 00:06:06,130 --> 00:06:07,680 doing for logistic progression, we won't 178 00:06:07,890 --> 00:06:09,090 sum over those terms in 179 00:06:09,160 --> 00:06:11,050 our regularization term because we 180 00:06:11,160 --> 00:06:13,470 don't want to regularize them and 181 00:06:13,670 --> 00:06:15,140 string the values 0. 182 00:06:15,360 --> 00:06:16,530 But this is just one possible convention 183 00:06:17,670 --> 00:06:18,670 and even if you were 184 00:06:18,840 --> 00:06:20,960 to sum over, you know, i equals 0 up 185 00:06:21,200 --> 00:06:22,810 to SL, it will work 186 00:06:23,160 --> 00:06:24,720 about the same and it doesn't make a big difference. 187 00:06:25,530 --> 00:06:26,760 But maybe this convention 188 00:06:27,500 --> 00:06:28,790 of not regularizing the bias 189 00:06:29,070 --> 00:06:30,320 term is just slightly more common. 190 00:06:32,960 --> 00:06:34,200 So, that's the cost function 191 00:06:34,690 --> 00:06:36,270 we're going to use to fill on your own network. 192 00:06:36,790 --> 00:06:38,130 In the next video, we'll start 193 00:06:38,480 --> 00:06:40,270 to talk about an algorithm for 194 00:06:40,570 --> 00:06:42,530 trying to optimize the cost function.