1 00:00:00,180 --> 00:00:01,210 In the last video, we talked 2 00:00:01,580 --> 00:00:02,950 about the process of evaluating 3 00:00:03,790 --> 00:00:05,780 an anomaly detection algorithm and 4 00:00:05,910 --> 00:00:06,980 there we started to use some 5 00:00:07,210 --> 00:00:08,810 labelled data, with examples 6 00:00:08,880 --> 00:00:10,150 that we knew were either anomalous 7 00:00:11,010 --> 00:00:13,170 or not anomalous, with y equals 1 or y equals 0. 8 00:00:14,690 --> 00:00:15,380 So the question then arises, if 9 00:00:15,690 --> 00:00:17,700 we have this labeled data, 10 00:00:18,130 --> 00:00:19,620 we have some examples that are 11 00:00:19,750 --> 00:00:20,840 known to be anomalies and some 12 00:00:21,020 --> 00:00:21,850 that are known not to be not 13 00:00:22,090 --> 00:00:23,540 anomalies, why don't we 14 00:00:23,640 --> 00:00:25,580 just use a supervised learning algorithm, 15 00:00:25,720 --> 00:00:26,790 so why don't we just use 16 00:00:27,110 --> 00:00:28,360 logistic regression or a neural 17 00:00:28,680 --> 00:00:29,770 network to try to 18 00:00:30,020 --> 00:00:31,260 learn directly from our labeled 19 00:00:31,550 --> 00:00:34,120 data, to predict whether y equals one or y equals zero. 20 00:00:34,900 --> 00:00:35,900 In this video, I'll try to 21 00:00:36,160 --> 00:00:37,170 share with you some of 22 00:00:37,350 --> 00:00:38,820 the thinking and some guidelines for 23 00:00:39,130 --> 00:00:40,610 when you should probably use an 24 00:00:40,720 --> 00:00:42,160 anomaly detection algorithm and when 25 00:00:42,440 --> 00:00:43,500 it might be more fruitful to consider 26 00:00:43,920 --> 00:00:45,380 using a supervised learning algorithm. 27 00:00:47,160 --> 00:00:48,950 This slide shows, what are 28 00:00:49,010 --> 00:00:50,130 the settings under which you should 29 00:00:50,900 --> 00:00:52,370 maybe use anomaly detection versus 30 00:00:52,930 --> 00:00:54,590 when supervised learning might be more fruitful. 31 00:00:56,030 --> 00:00:57,440 If you have a problem with a 32 00:00:57,560 --> 00:00:58,820 very small number of positive 33 00:00:59,720 --> 00:01:01,780 examples, and remember examples of 34 00:01:01,890 --> 00:01:03,000 y equals one are the 35 00:01:03,650 --> 00:01:05,520 anomalous examples, then 36 00:01:06,170 --> 00:01:08,160 you might consider using an anomaly detection algorithm inset. 37 00:01:09,260 --> 00:01:10,430 So having 0 to 20, 38 00:01:10,600 --> 00:01:12,740 maybe up to 50 positive examples, 39 00:01:13,450 --> 00:01:15,190 might be pretty typical, and usually, 40 00:01:15,680 --> 00:01:16,810 we have such a small set 41 00:01:17,130 --> 00:01:18,340 of positive examples, 42 00:01:19,270 --> 00:01:20,170 we are going to save the positive 43 00:01:20,510 --> 00:01:21,530 examples just for the cross 44 00:01:21,840 --> 00:01:24,440 validation sets and test sets. 45 00:01:24,850 --> 00:01:26,130 In contrast, in a typical 46 00:01:26,510 --> 00:01:28,560 normal anomaly detection setting, 47 00:01:29,340 --> 00:01:30,630 we will often have a relatively 48 00:01:31,010 --> 00:01:32,340 large number of negative examples, 49 00:01:33,110 --> 00:01:34,300 of these normal examples of 50 00:01:34,910 --> 00:01:36,710 normal aircraft engines. 51 00:01:37,720 --> 00:01:38,900 And we can then use this very 52 00:01:39,200 --> 00:01:40,240 large number of negative examples, 53 00:01:41,470 --> 00:01:42,510 with which to fit the model 54 00:01:43,000 --> 00:01:44,090 p of x. And so, there is 55 00:01:44,190 --> 00:01:45,930 this idea in many anomaly detection 56 00:01:46,320 --> 00:01:48,510 applications, you have 57 00:01:48,760 --> 00:01:50,220 very few positive examples, and 58 00:01:50,320 --> 00:01:52,540 lots of negative examples, and when 59 00:01:52,810 --> 00:01:54,960 we are doing the process of 60 00:01:55,220 --> 00:01:57,520 estimating p of x, of fitting all those Gaussian parameters, 61 00:01:58,650 --> 00:02:00,690 we need only negative examples to do that. 62 00:02:00,850 --> 00:02:01,680 So if you have a lot of negative data, 63 00:02:02,140 --> 00:02:04,310 we can still fit to p of x pretty well. 64 00:02:05,340 --> 00:02:07,090 In contrast, for supervised learning, 65 00:02:07,760 --> 00:02:08,790 more typically we would have 66 00:02:09,180 --> 00:02:10,810 a reasonably large number of 67 00:02:11,050 --> 00:02:12,370 both positive and negative examples. 68 00:02:13,920 --> 00:02:14,970 And so this is one 69 00:02:15,070 --> 00:02:16,240 way to look at your problem 70 00:02:16,770 --> 00:02:17,860 and decide if you should use 71 00:02:18,240 --> 00:02:20,180 an anomaly detection algorithm or a supervised learning algorithm. 72 00:02:21,750 --> 00:02:24,750 Here is another way people often think about anomaly detection algorithms. 73 00:02:25,530 --> 00:02:26,890 So, for anomaly detection applications 74 00:02:27,900 --> 00:02:28,890 often there are many 75 00:02:29,040 --> 00:02:30,600 different types of anomalies. 76 00:02:31,280 --> 00:02:31,770 So think about aircraft engines. 77 00:02:32,040 --> 00:02:34,680 You know there are so many different ways for aircraft engines to go wrong. 78 00:02:34,880 --> 00:02:36,980 Right? There are so many things that could go wrong that could break an aircraft engine. 79 00:02:38,830 --> 00:02:40,080 And so, if that's the 80 00:02:40,120 --> 00:02:40,940 case and you have a pretty small 81 00:02:41,140 --> 00:02:43,560 set of positive examples, then 82 00:02:44,430 --> 00:02:46,760 it can be difficult for 83 00:02:47,580 --> 00:02:48,380 an algorithm to learn from your small 84 00:02:48,740 --> 00:02:50,130 set of positive examples what the anomalies look like. 85 00:02:50,180 --> 00:02:51,880 And in particular, 86 00:02:52,800 --> 00:02:54,050 you know, future anomalies may look 87 00:02:54,220 --> 00:02:55,750 nothing like the ones you've seen so far. 88 00:02:56,050 --> 00:02:57,540 So maybe in your set 89 00:02:57,790 --> 00:02:59,030 of positive examples, maybe you 90 00:02:59,190 --> 00:02:59,740 had seen 5 or 10, or 20 91 00:02:59,950 --> 00:03:02,960 different ways that an aircraft engine could go wrong. 92 00:03:03,780 --> 00:03:05,600 But maybe tomorrow, you 93 00:03:05,780 --> 00:03:07,110 need to detect a totally 94 00:03:07,440 --> 00:03:08,870 new set, a totally new 95 00:03:09,250 --> 00:03:10,620 type of anomaly, a totally 96 00:03:10,820 --> 00:03:12,170 new way for an aircraft 97 00:03:12,570 --> 00:03:13,870 engine to be broken that 98 00:03:14,090 --> 00:03:15,660 you have just never seen before, 99 00:03:15,950 --> 00:03:17,010 and if that is the case, 100 00:03:17,550 --> 00:03:18,460 then it might be more 101 00:03:18,650 --> 00:03:20,020 promising to just model 102 00:03:20,480 --> 00:03:21,770 the negative examples, with a 103 00:03:21,970 --> 00:03:23,620 sort of a Gaussian model 104 00:03:23,970 --> 00:03:24,950 P of X. Rather than try 105 00:03:25,290 --> 00:03:26,250 too hard to model the positive 106 00:03:26,640 --> 00:03:27,850 examples, because, you know, 107 00:03:28,040 --> 00:03:29,310 tomorrow's anomaly may be 108 00:03:29,420 --> 00:03:32,680 nothing like the ones you've seen so far. 109 00:03:33,140 --> 00:03:34,640 In contrast, in some other 110 00:03:34,790 --> 00:03:36,170 problems you have enough 111 00:03:36,600 --> 00:03:37,790 positive examples for an algorithm 112 00:03:38,730 --> 00:03:40,850 to get a sense of what the positive examples are like. 113 00:03:40,980 --> 00:03:42,860 And in particular, if you 114 00:03:42,960 --> 00:03:44,270 think that future positive examples 115 00:03:44,870 --> 00:03:45,690 are likely to be similar 116 00:03:46,130 --> 00:03:46,980 to ones in the training set, 117 00:03:47,670 --> 00:03:49,090 then in that setting it might 118 00:03:49,230 --> 00:03:51,720 be more reasonable to have a supervised learning algorithm, 119 00:03:52,550 --> 00:03:53,390 that looks at a lot of 120 00:03:53,520 --> 00:03:54,760 the positive examples, looks at a 121 00:03:54,930 --> 00:03:56,530 lot of the negative examples, and 122 00:03:56,650 --> 00:03:58,980 uses that to try to distinguish between positives and negatives. 123 00:04:01,620 --> 00:04:02,780 So hopefully this gives you 124 00:04:02,870 --> 00:04:04,180 a sense of if you have 125 00:04:04,520 --> 00:04:05,690 a specific problem you should think 126 00:04:05,950 --> 00:04:07,800 about using the anomaly 127 00:04:08,110 --> 00:04:09,450 detection algorithm or a supervised learning algorithm. 128 00:04:11,110 --> 00:04:12,340 And the key difference really is, 129 00:04:12,520 --> 00:04:13,870 that in anomaly detection, after 130 00:04:14,020 --> 00:04:15,040 we have such a small 131 00:04:15,330 --> 00:04:17,200 number of positive examples that there 132 00:04:17,240 --> 00:04:18,640 is not possible, for a learning 133 00:04:19,330 --> 00:04:21,810 algorithm to learn that much from the positive examples. 134 00:04:22,430 --> 00:04:23,440 And so what we do instead, 135 00:04:23,890 --> 00:04:25,050 is take a large set of 136 00:04:25,230 --> 00:04:26,420 negative examples, and have it just 137 00:04:27,050 --> 00:04:28,070 learned a lot, learned p 138 00:04:28,230 --> 00:04:29,300 of x from just the negative 139 00:04:29,500 --> 00:04:31,730 examples of the normal aircraft engines, say. 140 00:04:32,190 --> 00:04:33,480 And we reserve the small 141 00:04:33,640 --> 00:04:36,740 number of positive examples for evaluating our algorithm 142 00:04:37,350 --> 00:04:39,680 to use in either the cross validation sets or the test sets. 143 00:04:41,210 --> 00:04:42,380 And just as a side comment about 144 00:04:42,620 --> 00:04:43,970 these many different types of 145 00:04:44,090 --> 00:04:45,490 anomalies, you know, in 146 00:04:45,790 --> 00:04:46,910 some earlier videos we talked 147 00:04:47,050 --> 00:04:49,060 about the email SPAM examples. 148 00:04:50,020 --> 00:04:51,510 In those examples, there are 149 00:04:51,910 --> 00:04:53,450 actually many different types of SPAM email. 150 00:04:53,930 --> 00:04:54,750 The SPAM email is trying to 151 00:04:55,030 --> 00:04:57,650 sell you things spam email, trying to steal your passwords, 152 00:04:58,470 --> 00:05:01,060 this is called fishing emails, and many different types of SPAM emails. 153 00:05:01,820 --> 00:05:03,490 But for the SPAM problem, we usually 154 00:05:03,930 --> 00:05:05,660 have enough examples of spam 155 00:05:06,000 --> 00:05:07,400 email to see, you know, 156 00:05:07,490 --> 00:05:08,650 most of these different types of 157 00:05:08,890 --> 00:05:10,200 SPAM email, because we have a 158 00:05:10,410 --> 00:05:11,650 large set of examples of 159 00:05:11,860 --> 00:05:13,050 SPAM, and that's why we 160 00:05:13,330 --> 00:05:14,800 usually think of SPAM as 161 00:05:14,980 --> 00:05:16,510 a supervised learning setting, even 162 00:05:16,710 --> 00:05:17,390 though, you know, there may be 163 00:05:17,530 --> 00:05:19,230 many different types of SPAM. 164 00:05:21,890 --> 00:05:23,170 And so, if we look at 165 00:05:23,310 --> 00:05:24,940 some applications of anomaly detection 166 00:05:25,600 --> 00:05:27,290 versus supervised learning, we'll find 167 00:05:27,480 --> 00:05:29,280 that, in fraud detection, if 168 00:05:29,410 --> 00:05:31,040 you have many different types 169 00:05:31,450 --> 00:05:32,510 of ways for people to 170 00:05:32,680 --> 00:05:34,120 try to commit fraud, and a 171 00:05:34,170 --> 00:05:35,730 relevantly small training set, a 172 00:05:35,880 --> 00:05:37,500 small number of fraudulent users 173 00:05:37,920 --> 00:05:40,300 on your website, then I would use an anomaly detection algorithm. 174 00:05:41,310 --> 00:05:42,520 I should say, if you 175 00:05:42,650 --> 00:05:44,560 have, if you are very a 176 00:05:44,700 --> 00:05:46,810 major online retailer, and 177 00:05:46,930 --> 00:05:48,170 if you actually have had a 178 00:05:48,330 --> 00:05:49,230 lot of people try to commit 179 00:05:49,390 --> 00:05:50,420 fraud on your website, so if 180 00:05:50,480 --> 00:05:51,340 you actually have a lot of 181 00:05:51,410 --> 00:05:53,760 examples where y equals 1, then 182 00:05:53,970 --> 00:05:55,410 you know, sometimes fraud detection 183 00:05:55,700 --> 00:05:58,030 could actually shift over to the supervised learning column. 184 00:05:58,850 --> 00:06:01,000 But, if you 185 00:06:01,210 --> 00:06:02,440 haven't seen that many 186 00:06:02,940 --> 00:06:04,480 examples of users doing 187 00:06:04,690 --> 00:06:05,720 strange things on your website 188 00:06:05,920 --> 00:06:07,970 then, more frequently, fraud detection 189 00:06:08,510 --> 00:06:09,730 is actually treated as an 190 00:06:09,990 --> 00:06:12,060 anomaly detection algorithm, rather than one of the supervised learning algorithm. 191 00:06:14,140 --> 00:06:15,160 Other examples, we talked about 192 00:06:15,310 --> 00:06:16,810 manufacturing already, hopefully you'll 193 00:06:16,950 --> 00:06:18,230 see more normal examples, 194 00:06:19,110 --> 00:06:19,840 not that many anomalies. 195 00:06:20,520 --> 00:06:21,560 But then again, for some manufacturing 196 00:06:22,180 --> 00:06:23,900 processes, if you're 197 00:06:23,990 --> 00:06:25,690 manufacturing very large volumes 198 00:06:25,860 --> 00:06:26,780 and you've seen a lot 199 00:06:27,230 --> 00:06:29,220 of bad examples, maybe manufacturing 200 00:06:29,790 --> 00:06:31,690 could shift to the supervised learning column as well. 201 00:06:32,630 --> 00:06:33,680 But, if you haven't seen that 202 00:06:33,950 --> 00:06:35,640 many bad examples of 203 00:06:35,830 --> 00:06:38,140 the old products, then I'll do this anomaly detection. 204 00:06:39,180 --> 00:06:40,290 Monitoring machines in the 205 00:06:40,400 --> 00:06:42,450 data center, again similar 206 00:06:42,880 --> 00:06:44,050 sorts of arguments apply. 207 00:06:45,280 --> 00:06:46,650 Whereas, email SPAM 208 00:06:47,070 --> 00:06:48,950 classification, weather prediction, and classifying 209 00:06:49,510 --> 00:06:50,580 cancers, if you have 210 00:06:51,200 --> 00:06:52,850 equal numbers of positive and 211 00:06:52,870 --> 00:06:53,920 negative examples, a lot of you 212 00:06:54,010 --> 00:06:55,550 have many examples of your 213 00:06:55,670 --> 00:06:56,780 positive and your negative 214 00:06:57,030 --> 00:06:57,870 examples, then, we would tend to 215 00:06:58,080 --> 00:07:00,630 treat all of these as supervised learning problems. 216 00:07:03,400 --> 00:07:04,500 So, hopefully, that gives you 217 00:07:04,580 --> 00:07:05,600 a sense of what are the 218 00:07:05,770 --> 00:07:07,050 properties of a learning 219 00:07:07,350 --> 00:07:08,980 problem that would cause you to 220 00:07:09,420 --> 00:07:10,410 treat it as an anomaly 221 00:07:10,810 --> 00:07:12,660 detention problem verses a supervised learning 222 00:07:14,250 --> 00:07:14,250 problem. 223 00:07:14,690 --> 00:07:16,020 And for many of the problems that are 224 00:07:16,260 --> 00:07:17,820 faced by various technology companies 225 00:07:18,200 --> 00:07:19,780 and so on, we actually are 226 00:07:19,860 --> 00:07:20,900 in these settings where we have 227 00:07:21,510 --> 00:07:23,320 very few or sometimes zero 228 00:07:24,060 --> 00:07:25,090 positive training examples, 229 00:07:25,400 --> 00:07:26,830 maybe there are so many 230 00:07:26,980 --> 00:07:28,410 different types of anomalies that we've never 231 00:07:28,530 --> 00:07:29,810 seen them before, and for those 232 00:07:29,960 --> 00:07:31,900 sorts of problems, very often, 233 00:07:32,440 --> 00:07:33,580 the algorithm that is used 234 00:07:33,790 --> 00:07:35,170 is an anomaly detection algorithm.