1 00:00:00,210 --> 00:00:01,300 In the last video, I talked 2 00:00:01,600 --> 00:00:03,390 about how when faced with 3 00:00:03,520 --> 00:00:04,780 a machine learning problem, there are 4 00:00:04,980 --> 00:00:07,260 often lots of different ideas on how to improve the algorithm. 5 00:00:08,460 --> 00:00:09,510 In this video let's 6 00:00:09,650 --> 00:00:11,060 talk about the concepts of error 7 00:00:11,330 --> 00:00:12,980 analysis which will help 8 00:00:13,070 --> 00:00:13,980 me give you a way to more 9 00:00:14,300 --> 00:00:15,830 systematically make some of these decisions. 10 00:00:18,070 --> 00:00:19,420 If you're starting work on a 11 00:00:19,540 --> 00:00:21,210 machine learning product or building 12 00:00:21,400 --> 00:00:23,340 a machine learning application, it is 13 00:00:23,480 --> 00:00:24,880 often considered very good practice 14 00:00:25,840 --> 00:00:27,000 to start, not by building 15 00:00:27,520 --> 00:00:29,070 a very complicated system with 16 00:00:29,220 --> 00:00:30,490 lots of complex features and so 17 00:00:30,930 --> 00:00:32,450 on, but to instead start 18 00:00:33,060 --> 00:00:34,120 by building a very simple 19 00:00:34,510 --> 00:00:35,760 algorithm, the you can implement quickly. 20 00:00:37,480 --> 00:00:38,610 And when I start on 21 00:00:38,740 --> 00:00:39,770 a learning problem, what I usually 22 00:00:40,150 --> 00:00:41,350 do is spend at most one 23 00:00:41,570 --> 00:00:43,160 day, literally at most 24 24 00:00:43,460 --> 00:00:46,030 hours to try to get something really quick and dirty. 25 00:00:47,040 --> 00:00:48,550 Frankly not at all sophisticated system. 26 00:00:49,370 --> 00:00:50,310 But get something really quick and 27 00:00:50,400 --> 00:00:52,080 dirty running and implement 28 00:00:52,590 --> 00:00:53,710 it and then test it on 29 00:00:53,880 --> 00:00:55,870 my cross validation data. Once 30 00:00:56,050 --> 00:00:57,140 you've done that, you can 31 00:00:57,480 --> 00:00:58,690 then plot learning curves. 32 00:00:59,960 --> 00:01:02,670 This is what we talked about in the previous set of videos. 33 00:01:03,230 --> 00:01:05,160 But plot learning curves of the 34 00:01:05,370 --> 00:01:07,120 training and test errors to 35 00:01:07,310 --> 00:01:08,280 try to figure out if your 36 00:01:08,400 --> 00:01:09,630 learning algorithm may be suffering 37 00:01:10,120 --> 00:01:11,240 from high bias or high 38 00:01:11,440 --> 00:01:13,180 variance or something else and 39 00:01:13,440 --> 00:01:14,380 use that to try to 40 00:01:14,490 --> 00:01:15,610 decide if having more data 41 00:01:16,080 --> 00:01:17,990 and more features and so on are likely to help. 42 00:01:18,670 --> 00:01:19,830 And the reason that this 43 00:01:20,000 --> 00:01:20,980 is a good approach is often 44 00:01:21,940 --> 00:01:22,980 when you're just starting out on 45 00:01:23,100 --> 00:01:24,460 a learning problem, there's really no 46 00:01:24,680 --> 00:01:25,820 way to tell in advance 47 00:01:26,480 --> 00:01:27,360 whether you need more complex 48 00:01:27,790 --> 00:01:29,200 features or whether you need 49 00:01:29,250 --> 00:01:30,950 more data or something else. 50 00:01:31,280 --> 00:01:32,270 And it's just very hard to tell 51 00:01:32,510 --> 00:01:33,840 in advance, that is in 52 00:01:33,970 --> 00:01:36,040 the absence of evidence, in 53 00:01:36,160 --> 00:01:37,840 the absence of seeing a 54 00:01:37,970 --> 00:01:39,130 learning curve, it's just incredibly 55 00:01:39,750 --> 00:01:42,860 difficult to figure out where you should be spending your time. 56 00:01:43,760 --> 00:01:45,360 And it's often by implementing even 57 00:01:45,730 --> 00:01:46,670 a very, very quick and dirty 58 00:01:46,980 --> 00:01:48,100 implementation and by plotting 59 00:01:48,540 --> 00:01:51,070 learning curves that that helps you make these decisions. 60 00:01:52,580 --> 00:01:53,340 So if you like, you can think 61 00:01:53,560 --> 00:01:54,490 of this as a way of 62 00:01:54,620 --> 00:01:56,270 avoiding what's sometimes called 63 00:01:56,570 --> 00:01:58,950 premature optimization in computer programming. 64 00:02:00,000 --> 00:02:01,070 And this is idea that just 65 00:02:01,200 --> 00:02:03,130 says that we should let 66 00:02:03,460 --> 00:02:04,920 evidence guide our decisions 67 00:02:05,650 --> 00:02:06,540 on where to spend our time 68 00:02:07,160 --> 00:02:08,150 rather than use gut feeling, 69 00:02:09,070 --> 00:02:09,680 which is often wrong. 70 00:02:10,930 --> 00:02:12,120 In addition to plotting learning 71 00:02:12,390 --> 00:02:13,540 curves, one other thing 72 00:02:13,810 --> 00:02:16,440 that's often very useful to do is what's called error analysis. 73 00:02:18,120 --> 00:02:19,080 And what I mean by that is 74 00:02:19,280 --> 00:02:20,520 that when building, say 75 00:02:20,770 --> 00:02:22,190 a spam classifier, I will 76 00:02:22,470 --> 00:02:24,500 often look at my 77 00:02:24,730 --> 00:02:26,690 cross validation set and manually 78 00:02:27,360 --> 00:02:29,110 look at the emails that my 79 00:02:29,310 --> 00:02:30,910 algorithm is making errors on. 80 00:02:31,180 --> 00:02:32,250 So, look at the spam emails 81 00:02:32,630 --> 00:02:34,440 and non-spam emails that the 82 00:02:34,640 --> 00:02:36,920 algorithm is misclassifying, and see 83 00:02:37,430 --> 00:02:38,590 if you can spot any systematic 84 00:02:39,210 --> 00:02:41,300 patterns in what type of examples it is misclassifying. 85 00:02:42,980 --> 00:02:44,560 And often by doing that, this 86 00:02:44,810 --> 00:02:45,960 is the process that would inspire 87 00:02:47,170 --> 00:02:48,800 you to design new features. 88 00:02:49,430 --> 00:02:50,420 Or they'll tell you whether the 89 00:02:50,920 --> 00:02:52,150 current things or current 90 00:02:52,400 --> 00:02:53,290 shortcomings of the system 91 00:02:54,270 --> 00:02:55,550 and give you the inspiration you 92 00:02:55,660 --> 00:02:57,680 need to come up with improvements to it. 93 00:02:58,260 --> 00:03:00,070 Concretely, here's a specific example. 94 00:03:01,350 --> 00:03:02,360 Let's say you've built a spam 95 00:03:02,780 --> 00:03:05,740 classifier and you 96 00:03:05,840 --> 00:03:07,720 have 500 examples in 97 00:03:07,940 --> 00:03:09,650 your cross-validation set. 98 00:03:10,410 --> 00:03:11,760 And let's say in this example, that the 99 00:03:12,010 --> 00:03:13,060 algorithm has a very high error 100 00:03:13,340 --> 00:03:14,640 rate, and it misclassifies a 101 00:03:14,910 --> 00:03:16,500 hundred of these cross-validation examples. 102 00:03:18,770 --> 00:03:19,850 So what I do is manually 103 00:03:20,450 --> 00:03:22,370 examine these 100 errors, and 104 00:03:22,530 --> 00:03:24,450 manually categorize them, based 105 00:03:24,700 --> 00:03:25,810 on things like what type 106 00:03:25,980 --> 00:03:27,110 of email it is and 107 00:03:27,270 --> 00:03:28,630 what cues or what features you 108 00:03:28,710 --> 00:03:31,130 think might have helped the algorithm classify them incorrectly. 109 00:03:32,450 --> 00:03:33,880 So, specifically, by what 110 00:03:34,080 --> 00:03:35,050 type of email it is, 111 00:03:35,560 --> 00:03:36,870 you know, if I look through these 112 00:03:37,140 --> 00:03:38,180 hundred errors I may find 113 00:03:38,520 --> 00:03:39,660 that maybe the most 114 00:03:39,970 --> 00:03:41,350 common types of spam 115 00:03:41,840 --> 00:03:43,450 emails in misclassifies are maybe 116 00:03:44,010 --> 00:03:45,610 emails on pharmacy, so basically 117 00:03:45,610 --> 00:03:48,300 these are emails trying to 118 00:03:48,610 --> 00:03:50,000 sell drugs, maybe emails that are 119 00:03:50,180 --> 00:03:51,740 trying to sell replicas - 120 00:03:51,760 --> 00:03:54,330 those are those fake watches fake you know, random things. 121 00:03:56,160 --> 00:03:59,410 Maybe have some emails trying to steal passwords. 122 00:04:00,240 --> 00:04:01,400 These are also called phishing emails. 123 00:04:02,180 --> 00:04:04,690 But that's another big category of emails and maybe other categories. 124 00:04:06,160 --> 00:04:07,800 So, in terms 125 00:04:08,120 --> 00:04:09,230 of classify what type of email 126 00:04:09,530 --> 00:04:10,420 it is, I would actually go through 127 00:04:10,890 --> 00:04:11,990 and count up, you know, of 128 00:04:12,200 --> 00:04:14,220 my 100 emails, maybe I 129 00:04:14,400 --> 00:04:15,510 find that twelve of the 130 00:04:15,620 --> 00:04:17,600 mislabeled emails are pharma emails. 131 00:04:18,100 --> 00:04:19,460 And maybe four of them 132 00:04:19,700 --> 00:04:20,840 are emails trying to sell 133 00:04:20,980 --> 00:04:22,680 replicas, they sell fake watches or something. 134 00:04:23,720 --> 00:04:25,060 And maybe I find that 53 135 00:04:25,650 --> 00:04:26,970 of them are these, 136 00:04:27,720 --> 00:04:29,480 what's called phishing emails, basically emails 137 00:04:29,730 --> 00:04:30,900 trying to persuade you to 138 00:04:31,020 --> 00:04:32,760 give them your password, and 31 emails are other types of emails. 139 00:04:35,330 --> 00:04:37,210 And it's by counting up the 140 00:04:37,280 --> 00:04:38,280 number of emails in these 141 00:04:38,430 --> 00:04:39,540 different categories that you might 142 00:04:39,790 --> 00:04:41,570 discover, for example, that the 143 00:04:41,870 --> 00:04:43,100 algorithm is doing really particularly 144 00:04:44,170 --> 00:04:45,640 poorly on emails trying to 145 00:04:45,780 --> 00:04:47,240 steal passwords, and that 146 00:04:47,400 --> 00:04:49,230 may suggest that it might 147 00:04:49,380 --> 00:04:50,490 be worth your effort to look 148 00:04:50,690 --> 00:04:51,650 more carefully at that type 149 00:04:51,900 --> 00:04:53,350 of email, and see if 150 00:04:53,450 --> 00:04:54,450 you can come up with better features 151 00:04:55,070 --> 00:04:56,280 to categorize them correctly. 152 00:04:57,550 --> 00:04:58,930 And also, what I might 153 00:04:59,000 --> 00:05:00,130 do is look at what cues, 154 00:05:00,550 --> 00:05:02,120 or what features, additional features 155 00:05:02,620 --> 00:05:04,920 might have helped the algorithm classify the emails. 156 00:05:06,090 --> 00:05:06,970 So let's say that some of 157 00:05:07,060 --> 00:05:09,700 our hypotheses about things or 158 00:05:09,840 --> 00:05:10,780 features that might help us 159 00:05:10,920 --> 00:05:13,240 classify emails better are trying 160 00:05:13,490 --> 00:05:15,600 to detect deliberate misspellings versus 161 00:05:16,220 --> 00:05:18,610 unusual email routing versus unusual, you know, 162 00:05:19,950 --> 00:05:21,450 spamming punctuation, such as 163 00:05:21,790 --> 00:05:23,230 people use a lot of exclamation marks. 164 00:05:23,700 --> 00:05:24,470 And once again, I would manually 165 00:05:24,860 --> 00:05:25,670 go through and let's say 166 00:05:25,760 --> 00:05:27,490 I find five cases of 167 00:05:27,620 --> 00:05:29,400 this, and 16 of 168 00:05:29,500 --> 00:05:30,560 this, and 32 of this and 169 00:05:31,180 --> 00:05:33,620 a bunch of other types of emails as well. 170 00:05:34,770 --> 00:05:36,180 And if this is what 171 00:05:36,350 --> 00:05:37,470 you get on your cross validation 172 00:05:38,070 --> 00:05:39,170 set then it really tells 173 00:05:39,300 --> 00:05:41,060 you that, you know, maybe deliberate spelling 174 00:05:41,660 --> 00:05:42,730 is a sufficiently rare phenomenon 175 00:05:43,500 --> 00:05:44,480 that maybe is not really worth 176 00:05:44,840 --> 00:05:47,120 all your time trying to write 177 00:05:47,710 --> 00:05:48,780 algorithms to detect that. 178 00:05:49,480 --> 00:05:50,480 But if you find a lot 179 00:05:50,780 --> 00:05:52,070 of spammers are using, you 180 00:05:52,140 --> 00:05:54,150 know, unusual punctuation then 181 00:05:54,290 --> 00:05:55,250 maybe that's a strong sign 182 00:05:55,670 --> 00:05:56,730 that it might actually be 183 00:05:57,000 --> 00:05:58,510 worth your while to spend 184 00:05:58,780 --> 00:06:00,280 the time to develop more sophisticated 185 00:06:00,910 --> 00:06:02,190 features based on the punctuation. 186 00:06:03,330 --> 00:06:04,870 So, this sort of error 187 00:06:05,040 --> 00:06:06,390 analysis which is really 188 00:06:06,690 --> 00:06:08,430 the process of manually examining 189 00:06:09,190 --> 00:06:10,540 the mistakes that the algorithm 190 00:06:10,780 --> 00:06:12,220 makes, can often help 191 00:06:12,560 --> 00:06:14,620 guide you to the most fruitful avenues to pursue. 192 00:06:16,000 --> 00:06:17,410 And this also explains why I 193 00:06:17,590 --> 00:06:19,260 often recommend implementing a quick 194 00:06:19,550 --> 00:06:21,250 and dirty implementation of an algorithm. 195 00:06:22,040 --> 00:06:22,940 What we really want to do 196 00:06:23,260 --> 00:06:24,290 is figure out what are 197 00:06:24,310 --> 00:06:26,770 the most difficult examples for an algorithm to classify. 198 00:06:27,860 --> 00:06:29,920 And very often for different 199 00:06:30,460 --> 00:06:31,730 algorithms, for different learning algorithms, 200 00:06:32,010 --> 00:06:33,500 they'll often find, you 201 00:06:33,560 --> 00:06:35,920 know, similar categories of examples difficult. 202 00:06:37,010 --> 00:06:37,970 And by having a quick and 203 00:06:38,060 --> 00:06:39,840 dirty implementation, that's often a 204 00:06:39,910 --> 00:06:40,850 quick way to let you 205 00:06:41,430 --> 00:06:43,070 identify some errors and quickly 206 00:06:43,620 --> 00:06:44,690 identify what are the 207 00:06:44,790 --> 00:06:47,760 hard examples so that you can focus your efforts on those. 208 00:06:49,230 --> 00:06:51,220 Lastly, when developing learning algorithms, 209 00:06:52,260 --> 00:06:53,880 one other useful tip is 210 00:06:54,190 --> 00:06:55,230 to make sure that you have 211 00:06:55,590 --> 00:06:56,450 a way, that you have a 212 00:06:56,810 --> 00:06:59,710 numerical evaluation of your learning algorithm. 213 00:07:02,130 --> 00:07:03,220 Now what I mean by that is that 214 00:07:03,460 --> 00:07:04,670 if you're developing a learning algorithm, 215 00:07:05,230 --> 00:07:07,180 it is often incredibly helpful 216 00:07:08,060 --> 00:07:09,170 if you have a way of 217 00:07:09,460 --> 00:07:10,830 evaluating your learning algorithm 218 00:07:11,290 --> 00:07:13,100 that just gives you back a single real number. 219 00:07:13,650 --> 00:07:14,880 Maybe accuracy, maybe error. 220 00:07:15,620 --> 00:07:18,390 But the single real number that tells you how well your learning algorithm is doing. 221 00:07:20,280 --> 00:07:21,330 I'll talk more about this specific 222 00:07:21,770 --> 00:07:24,650 concepts in later videos, but here's a specific example. 223 00:07:25,790 --> 00:07:26,600 Let's say we are trying to 224 00:07:26,690 --> 00:07:27,990 decide whether or not we 225 00:07:28,060 --> 00:07:29,140 should treat words like discount, 226 00:07:29,590 --> 00:07:32,060 discounts, discounter, discounting, as the same word. 227 00:07:32,370 --> 00:07:33,390 So maybe one way to 228 00:07:33,520 --> 00:07:34,770 do that is to just 229 00:07:35,400 --> 00:07:38,780 look at the first few characters in a word. 230 00:07:38,960 --> 00:07:40,240 Like, you know, if you just look at 231 00:07:40,300 --> 00:07:41,690 the first few characters of 232 00:07:41,780 --> 00:07:44,640 a word, then you figure 233 00:07:44,920 --> 00:07:45,970 out that maybe all of these 234 00:07:46,130 --> 00:07:47,990 words are roughly - have similar meanings. 235 00:07:50,460 --> 00:07:52,090 In natural language processing, the 236 00:07:52,250 --> 00:07:53,270 way that this is done is 237 00:07:53,510 --> 00:07:55,960 actually using a type of software called stemming software. 238 00:07:56,940 --> 00:07:58,080 If you ever want to do 239 00:07:58,160 --> 00:07:59,880 this yourself, search on a 240 00:07:59,950 --> 00:08:01,240 web search engine for the 241 00:08:01,500 --> 00:08:02,660 Porter Stemmer and that 242 00:08:02,960 --> 00:08:04,320 would be, you know, one reasonable piece of 243 00:08:04,620 --> 00:08:05,830 software for doing this sort 244 00:08:06,110 --> 00:08:07,020 of stemming, which will let 245 00:08:07,130 --> 00:08:08,140 you treat all of these discount, 246 00:08:08,800 --> 00:08:10,540 discounts, and so on as the same word. 247 00:08:13,950 --> 00:08:15,930 But using a stemming software 248 00:08:16,630 --> 00:08:17,710 that basically looks at the 249 00:08:17,830 --> 00:08:19,290 first few alphabets of the 250 00:08:19,450 --> 00:08:21,630 word more or less, it can help but it can hurt. 251 00:08:22,240 --> 00:08:23,490 And it can hurt because, for 252 00:08:23,900 --> 00:08:25,360 example, this software may 253 00:08:25,930 --> 00:08:27,850 mistake the words universe and 254 00:08:27,990 --> 00:08:29,980 university as being the 255 00:08:30,070 --> 00:08:31,220 same thing because, you know, 256 00:08:31,450 --> 00:08:33,220 these two words start off 257 00:08:33,480 --> 00:08:35,480 with very similar characters, with the same alphabets. 258 00:08:37,300 --> 00:08:39,050 So if you're trying 259 00:08:39,280 --> 00:08:40,290 to decide whether or not 260 00:08:40,630 --> 00:08:42,490 to use stemming software for 261 00:08:42,670 --> 00:08:45,960 a stem classifier, it is not always easy to tell. 262 00:08:46,350 --> 00:08:47,810 And in particular, error analysis 263 00:08:48,510 --> 00:08:49,590 may not actually be helpful 264 00:08:51,030 --> 00:08:52,860 for deciding if this 265 00:08:53,060 --> 00:08:54,410 sort of stemming idea is a good idea. 266 00:08:55,570 --> 00:08:56,740 Instead, the best way 267 00:08:57,020 --> 00:08:58,320 to figure out if using stemming 268 00:08:58,690 --> 00:08:59,970 software is good to help 269 00:09:00,190 --> 00:09:01,570 your classifier is if you 270 00:09:01,740 --> 00:09:02,980 have a way to very quickly 271 00:09:03,370 --> 00:09:05,170 just try it and see if it works. 272 00:09:08,560 --> 00:09:09,530 And in order to do this, 273 00:09:10,260 --> 00:09:11,350 having a way to numerically 274 00:09:12,250 --> 00:09:14,570 evaluate your algorithm, is going to be very helpful. 275 00:09:15,940 --> 00:09:17,670 Concretely, maybe the most 276 00:09:18,110 --> 00:09:19,190 natural thing to do is 277 00:09:19,350 --> 00:09:20,250 to look at the cross validation 278 00:09:20,900 --> 00:09:23,510 error of the algorithm's performance with and without stemming. 279 00:09:24,590 --> 00:09:25,560 So, if you run your 280 00:09:25,800 --> 00:09:27,190 algorithm without stemming and you 281 00:09:27,330 --> 00:09:28,430 end up with, let's say, 282 00:09:29,080 --> 00:09:31,260 five percent classification error, and 283 00:09:31,360 --> 00:09:32,410 you re-run it and you 284 00:09:32,540 --> 00:09:33,780 end up with, let's say, three 285 00:09:34,110 --> 00:09:36,170 percent classification error, then this 286 00:09:36,440 --> 00:09:37,920 decrease in error very quickly 287 00:09:38,640 --> 00:09:39,980 allows you to decide that, 288 00:09:40,310 --> 00:09:42,250 you know, it looks like using stemming is a good idea. 289 00:09:43,080 --> 00:09:44,650 For this particular problem, there's 290 00:09:44,940 --> 00:09:46,560 a very natural single real 291 00:09:46,830 --> 00:09:50,210 number evaluation metric, namely, the cross validation error. 292 00:09:50,930 --> 00:09:52,700 We'll see later, examples where coming 293 00:09:53,080 --> 00:09:54,360 up with this, sort of, single 294 00:09:54,790 --> 00:09:58,220 row number evaluation metric may need a little bit more work. 295 00:09:58,790 --> 00:09:59,840 But as we'll see in 296 00:09:59,930 --> 00:10:01,620 the later video, doing so would 297 00:10:01,750 --> 00:10:02,860 also then let you 298 00:10:02,990 --> 00:10:04,290 make these decisions much more quickly 299 00:10:04,760 --> 00:10:06,380 of, say, whether or not to use stemming. 300 00:10:08,700 --> 00:10:09,950 And just this one more quick example. 301 00:10:10,680 --> 00:10:11,670 Let's say that you're also trying 302 00:10:12,040 --> 00:10:13,450 to decide whether or not 303 00:10:13,650 --> 00:10:15,710 to distinguish between upper versus lower case. 304 00:10:15,990 --> 00:10:16,910 So, you know, is the red 305 00:10:17,060 --> 00:10:18,850 mom with uppercase M 306 00:10:19,060 --> 00:10:20,390 versus lower case m, 307 00:10:20,700 --> 00:10:21,720 I mean, should that be treated as 308 00:10:21,780 --> 00:10:23,810 the same word or as different words? 309 00:10:23,970 --> 00:10:26,890 Should these be treated as the same feature or as different features? 310 00:10:27,010 --> 00:10:28,060 And so once again, 311 00:10:28,350 --> 00:10:29,150 because we have a way 312 00:10:29,300 --> 00:10:30,790 to evaluate our algorithm, if 313 00:10:31,060 --> 00:10:32,350 you try this out here, if 314 00:10:32,650 --> 00:10:34,910 I stop distinguishing upper 315 00:10:35,140 --> 00:10:36,490 and lower case, maybe I 316 00:10:36,600 --> 00:10:38,580 end up with 3.2% 317 00:10:38,700 --> 00:10:39,820 error and I find that 318 00:10:40,020 --> 00:10:41,750 therefore this does worse 319 00:10:42,260 --> 00:10:43,360 than, you know, if I use only 320 00:10:43,640 --> 00:10:45,110 stemming, and so this lets 321 00:10:45,370 --> 00:10:47,420 me very quickly decide to go 322 00:10:48,270 --> 00:10:49,720 ahead and to distinguish or to 323 00:10:49,820 --> 00:10:51,540 not distinguish between upper and lower case. 324 00:10:52,140 --> 00:10:53,390 So when you' re developing 325 00:10:53,690 --> 00:10:55,260 a learning algorithm, very often 326 00:10:55,650 --> 00:10:56,840 you'll be trying out lots of 327 00:10:57,050 --> 00:10:59,930 new ideas and lots of new versions of your learning algorithm. 328 00:11:00,960 --> 00:11:02,050 If every time you try 329 00:11:02,350 --> 00:11:03,740 out a new idea if you 330 00:11:03,840 --> 00:11:05,610 end up manually examining a 331 00:11:05,750 --> 00:11:06,730 bunch of examples, you begin to 332 00:11:06,860 --> 00:11:08,530 see better or worse, you 333 00:11:08,640 --> 00:11:09,410 know, that's going to make it 334 00:11:09,580 --> 00:11:10,610 really hard to make decisions 335 00:11:10,980 --> 00:11:12,410 on do you use stemming or not. 336 00:11:12,580 --> 00:11:13,640 Do you distinguish upper or lowercase or not? 337 00:11:15,180 --> 00:11:16,590 But by having a single rule 338 00:11:16,770 --> 00:11:18,520 number evaluation metric, you can 339 00:11:18,680 --> 00:11:21,150 then just look and see oh, did the error go up or go down? 340 00:11:22,420 --> 00:11:23,620 And you can use that much 341 00:11:23,940 --> 00:11:25,760 more rapidly, try out 342 00:11:25,840 --> 00:11:27,820 new ideas and almost right 343 00:11:27,990 --> 00:11:29,550 away tell if your new 344 00:11:29,690 --> 00:11:31,480 idea has improved or worsened 345 00:11:32,440 --> 00:11:33,230 the performance of the learning algorithm 346 00:11:33,930 --> 00:11:35,440 and this will let 347 00:11:35,560 --> 00:11:38,340 you often make much faster progress. 348 00:11:38,530 --> 00:11:39,720 So the recommended, strongly recommended 349 00:11:40,220 --> 00:11:41,790 way to do error analysis is 350 00:11:42,370 --> 00:11:44,760 on the cross validation set rather than the test set. 351 00:11:45,490 --> 00:11:46,970 But, you know, there are 352 00:11:47,240 --> 00:11:48,260 people that will do this on 353 00:11:48,370 --> 00:11:49,480 the test set even though that's 354 00:11:49,730 --> 00:11:51,530 definitely a less mathematically appropriate 355 00:11:52,190 --> 00:11:54,560 set of your list, recommended what 356 00:11:54,730 --> 00:11:55,660 you think to do than to 357 00:11:55,780 --> 00:11:57,240 do error analysis on your 358 00:11:57,450 --> 00:11:58,760 cross validation sector. 359 00:11:59,140 --> 00:12:01,160 So, to wrap up this video, when starting 360 00:12:01,830 --> 00:12:03,340 on the new machine learning problem, what 361 00:12:03,610 --> 00:12:05,370 I almost always recommend is 362 00:12:05,610 --> 00:12:06,930 to implement a quick and 363 00:12:07,030 --> 00:12:08,710 dirty implementation of your learning algorithm. 364 00:12:09,780 --> 00:12:11,760 And I've almost never seen 365 00:12:12,120 --> 00:12:15,370 anyone spend too little time on this quick and dirty implementation. 366 00:12:18,640 --> 00:12:20,210 I pretty much only ever see 367 00:12:20,480 --> 00:12:22,050 people spend much too much 368 00:12:22,370 --> 00:12:23,720 time building their first, you know, 369 00:12:24,580 --> 00:12:25,800 supposedly quick and dirty implementations. 370 00:12:26,590 --> 00:12:28,100 So really, don't worry about 371 00:12:29,070 --> 00:12:31,210 it being too quick, or don't worry about it being too dirty. 372 00:12:32,120 --> 00:12:33,580 But really implement something as 373 00:12:33,690 --> 00:12:35,220 quickly as you can, and once 374 00:12:35,450 --> 00:12:37,550 you have the initial implementation this 375 00:12:37,820 --> 00:12:38,860 is then a powerful tool for 376 00:12:39,230 --> 00:12:40,420 deciding where to spend your 377 00:12:40,610 --> 00:12:42,170 time next, because first we 378 00:12:42,390 --> 00:12:43,390 can look at the errors it makes, 379 00:12:43,630 --> 00:12:44,720 and do this sort of error analysis 380 00:12:45,280 --> 00:12:46,360 to see what mistakes it makes 381 00:12:47,010 --> 00:12:48,420 and use that to inspire further development. 382 00:12:49,030 --> 00:12:50,880 And second, assuming your 383 00:12:51,000 --> 00:12:53,360 quick and dirty implementation incorporated a 384 00:12:53,620 --> 00:12:55,700 single real number evaluation metric, this 385 00:12:55,940 --> 00:12:57,660 can then be a vehicle for 386 00:12:57,730 --> 00:12:58,980 you to try out different ideas 387 00:12:59,810 --> 00:13:00,810 and quickly see if the 388 00:13:01,030 --> 00:13:02,170 different ideas you're trying out 389 00:13:02,440 --> 00:13:03,830 are improving the performance of 390 00:13:03,920 --> 00:13:05,420 your algorithm and therefore let 391 00:13:05,570 --> 00:13:06,470 you maybe much more quickly 392 00:13:06,860 --> 00:13:08,440 make decisions about what things 393 00:13:08,760 --> 00:13:09,900 to fold, and what things to 394 00:13:10,240 --> 00:13:11,520 incorporate into your learning algorithm.