1
00:00:00,180 --> 00:00:01,210
In the last video, we talked

2
00:00:01,580 --> 00:00:02,950
about the process of evaluating

3
00:00:03,790 --> 00:00:05,780
an anomaly detection algorithm and

4
00:00:05,910 --> 00:00:06,980
there we started to use some

5
00:00:07,210 --> 00:00:08,810
labelled data, with examples

6
00:00:08,880 --> 00:00:10,150
that we knew were either anomalous

7
00:00:11,010 --> 00:00:13,170
or not anomalous, with y equals 1 or y equals 0.

8
00:00:14,690 --> 00:00:15,380
So the question then arises, if

9
00:00:15,690 --> 00:00:17,700
we have this labeled data,

10
00:00:18,130 --> 00:00:19,620
we have some examples that are

11
00:00:19,750 --> 00:00:20,840
known to be anomalies and some

12
00:00:21,020 --> 00:00:21,850
that are known not to be not

13
00:00:22,090 --> 00:00:23,540
anomalies, why don't we

14
00:00:23,640 --> 00:00:25,580
just use a supervised learning algorithm,

15
00:00:25,720 --> 00:00:26,790
so why don't we just use

16
00:00:27,110 --> 00:00:28,360
logistic regression or a neural

17
00:00:28,680 --> 00:00:29,770
network to try to

18
00:00:30,020 --> 00:00:31,260
learn directly from our labeled

19
00:00:31,550 --> 00:00:34,120
data, to predict whether y equals one or y equals zero.

20
00:00:34,900 --> 00:00:35,900
In this video, I'll try to

21
00:00:36,160 --> 00:00:37,170
share with you some of

22
00:00:37,350 --> 00:00:38,820
the thinking and some guidelines for

23
00:00:39,130 --> 00:00:40,610
when you should probably use an

24
00:00:40,720 --> 00:00:42,160
anomaly detection algorithm and when

25
00:00:42,440 --> 00:00:43,500
it might be more fruitful to consider

26
00:00:43,920 --> 00:00:45,380
using a supervised learning algorithm.

27
00:00:47,160 --> 00:00:48,950
This slide shows, what are

28
00:00:49,010 --> 00:00:50,130
the settings under which you should

29
00:00:50,900 --> 00:00:52,370
maybe use anomaly detection versus

30
00:00:52,930 --> 00:00:54,590
when supervised learning might be more fruitful.

31
00:00:56,030 --> 00:00:57,440
If you have a problem with a

32
00:00:57,560 --> 00:00:58,820
very small number of positive

33
00:00:59,720 --> 00:01:01,780
examples, and remember examples of

34
00:01:01,890 --> 00:01:03,000
y equals one are the

35
00:01:03,650 --> 00:01:05,520
anomalous examples, then

36
00:01:06,170 --> 00:01:08,160
you might consider using an anomaly detection algorithm inset.

37
00:01:09,260 --> 00:01:10,430
So having 0 to 20,

38
00:01:10,600 --> 00:01:12,740
maybe up to 50 positive examples,

39
00:01:13,450 --> 00:01:15,190
might be pretty typical, and usually,

40
00:01:15,680 --> 00:01:16,810
we have such a small set

41
00:01:17,130 --> 00:01:18,340
of positive examples,

42
00:01:19,270 --> 00:01:20,170
we are going to save the positive

43
00:01:20,510 --> 00:01:21,530
examples just for the cross

44
00:01:21,840 --> 00:01:24,440
validation sets and test sets.

45
00:01:24,850 --> 00:01:26,130
In contrast, in a typical

46
00:01:26,510 --> 00:01:28,560
normal anomaly detection setting,

47
00:01:29,340 --> 00:01:30,630
we will often have a relatively

48
00:01:31,010 --> 00:01:32,340
large number of negative examples,

49
00:01:33,110 --> 00:01:34,300
of these normal examples of

50
00:01:34,910 --> 00:01:36,710
normal aircraft engines.

51
00:01:37,720 --> 00:01:38,900
And we can then use this very

52
00:01:39,200 --> 00:01:40,240
large number of negative examples,

53
00:01:41,470 --> 00:01:42,510
with which to fit the model

54
00:01:43,000 --> 00:01:44,090
p of x.  And so, there is

55
00:01:44,190 --> 00:01:45,930
this idea in many anomaly detection

56
00:01:46,320 --> 00:01:48,510
applications, you have

57
00:01:48,760 --> 00:01:50,220
very few positive examples, and

58
00:01:50,320 --> 00:01:52,540
lots of negative examples, and when

59
00:01:52,810 --> 00:01:54,960
we are doing the process of

60
00:01:55,220 --> 00:01:57,520
estimating p of x, of fitting all those Gaussian parameters,

61
00:01:58,650 --> 00:02:00,690
we need only negative examples to do that.

62
00:02:00,850 --> 00:02:01,680
So if you have a lot of negative data,

63
00:02:02,140 --> 00:02:04,310
we can still fit to p of x pretty well.

64
00:02:05,340 --> 00:02:07,090
In contrast, for supervised learning,

65
00:02:07,760 --> 00:02:08,790
more typically we would have

66
00:02:09,180 --> 00:02:10,810
a reasonably large number of

67
00:02:11,050 --> 00:02:12,370
both positive and negative examples.

68
00:02:13,920 --> 00:02:14,970
And so this is one

69
00:02:15,070 --> 00:02:16,240
way to look at your problem

70
00:02:16,770 --> 00:02:17,860
and decide if you should use

71
00:02:18,240 --> 00:02:20,180
an anomaly detection algorithm or a supervised learning algorithm.

72
00:02:21,750 --> 00:02:24,750
Here is another way people often think about anomaly detection algorithms.

73
00:02:25,530 --> 00:02:26,890
So, for anomaly detection applications

74
00:02:27,900 --> 00:02:28,890
often there are many

75
00:02:29,040 --> 00:02:30,600
different types of anomalies.

76
00:02:31,280 --> 00:02:31,770
So think about aircraft engines.

77
00:02:32,040 --> 00:02:34,680
You know there are so many different ways for aircraft engines to go wrong.

78
00:02:34,880 --> 00:02:36,980
Right? There are so many things that could go wrong that could break an aircraft engine.

79
00:02:38,830 --> 00:02:40,080
And so, if that's the

80
00:02:40,120 --> 00:02:40,940
case and you have a pretty small

81
00:02:41,140 --> 00:02:43,560
set of positive examples, then

82
00:02:44,430 --> 00:02:46,760
it can be difficult for

83
00:02:47,580 --> 00:02:48,380
an algorithm to learn from your small

84
00:02:48,740 --> 00:02:50,130
set of positive examples what the anomalies look like.

85
00:02:50,180 --> 00:02:51,880
And in particular,

86
00:02:52,800 --> 00:02:54,050
you know, future anomalies may look

87
00:02:54,220 --> 00:02:55,750
nothing like the ones you've seen so far.

88
00:02:56,050 --> 00:02:57,540
So maybe in your set

89
00:02:57,790 --> 00:02:59,030
of positive examples, maybe you

90
00:02:59,190 --> 00:02:59,740
had seen 5 or 10, or 20

91
00:02:59,950 --> 00:03:02,960
different ways that an aircraft engine could go wrong.

92
00:03:03,780 --> 00:03:05,600
But maybe tomorrow, you

93
00:03:05,780 --> 00:03:07,110
need to detect a totally

94
00:03:07,440 --> 00:03:08,870
new set, a totally new

95
00:03:09,250 --> 00:03:10,620
type of anomaly, a totally

96
00:03:10,820 --> 00:03:12,170
new way for an aircraft

97
00:03:12,570 --> 00:03:13,870
engine to be broken that

98
00:03:14,090 --> 00:03:15,660
you have just never seen before,

99
00:03:15,950 --> 00:03:17,010
and if that is the case,

100
00:03:17,550 --> 00:03:18,460
then it might be more

101
00:03:18,650 --> 00:03:20,020
promising to just model

102
00:03:20,480 --> 00:03:21,770
the negative examples, with a

103
00:03:21,970 --> 00:03:23,620
sort of a Gaussian model

104
00:03:23,970 --> 00:03:24,950
P of X. Rather than try

105
00:03:25,290 --> 00:03:26,250
too hard to model the positive

106
00:03:26,640 --> 00:03:27,850
examples, because, you know,

107
00:03:28,040 --> 00:03:29,310
tomorrow's anomaly may be

108
00:03:29,420 --> 00:03:32,680
nothing like the ones you've seen so far.

109
00:03:33,140 --> 00:03:34,640
In contrast, in some other

110
00:03:34,790 --> 00:03:36,170
problems you have enough

111
00:03:36,600 --> 00:03:37,790
positive examples for an algorithm

112
00:03:38,730 --> 00:03:40,850
to get a sense of what the positive examples are like.

113
00:03:40,980 --> 00:03:42,860
And in particular, if you

114
00:03:42,960 --> 00:03:44,270
think that future positive examples

115
00:03:44,870 --> 00:03:45,690
are likely to be similar

116
00:03:46,130 --> 00:03:46,980
to ones in the training set,

117
00:03:47,670 --> 00:03:49,090
then in that setting it might

118
00:03:49,230 --> 00:03:51,720
be more reasonable to have a supervised learning algorithm,

119
00:03:52,550 --> 00:03:53,390
that looks at a lot of

120
00:03:53,520 --> 00:03:54,760
the positive examples, looks at a

121
00:03:54,930 --> 00:03:56,530
lot of the negative examples, and

122
00:03:56,650 --> 00:03:58,980
uses that to try to distinguish between positives and negatives.

123
00:04:01,620 --> 00:04:02,780
So hopefully this gives you

124
00:04:02,870 --> 00:04:04,180
a sense of if you have

125
00:04:04,520 --> 00:04:05,690
a specific problem you should think

126
00:04:05,950 --> 00:04:07,800
about using the anomaly

127
00:04:08,110 --> 00:04:09,450
detection algorithm or a supervised learning algorithm.

128
00:04:11,110 --> 00:04:12,340
And the key difference really is,

129
00:04:12,520 --> 00:04:13,870
that in anomaly detection, after

130
00:04:14,020 --> 00:04:15,040
we have such a small

131
00:04:15,330 --> 00:04:17,200
number of positive examples that there

132
00:04:17,240 --> 00:04:18,640
is not possible, for a learning

133
00:04:19,330 --> 00:04:21,810
algorithm to learn that much from the positive examples.

134
00:04:22,430 --> 00:04:23,440
And so what we do instead,

135
00:04:23,890 --> 00:04:25,050
is take a large set of

136
00:04:25,230 --> 00:04:26,420
negative examples, and have it just

137
00:04:27,050 --> 00:04:28,070
learned a lot, learned p

138
00:04:28,230 --> 00:04:29,300
of x from just the negative

139
00:04:29,500 --> 00:04:31,730
examples of the normal aircraft engines, say.

140
00:04:32,190 --> 00:04:33,480
And we reserve the small

141
00:04:33,640 --> 00:04:36,740
number of positive examples for evaluating our algorithm

142
00:04:37,350 --> 00:04:39,680
to use in either the cross validation sets or the test sets.

143
00:04:41,210 --> 00:04:42,380
And just as a side comment about

144
00:04:42,620 --> 00:04:43,970
these many different types of

145
00:04:44,090 --> 00:04:45,490
anomalies, you know, in

146
00:04:45,790 --> 00:04:46,910
some earlier videos we talked

147
00:04:47,050 --> 00:04:49,060
about the email SPAM examples.

148
00:04:50,020 --> 00:04:51,510
In those examples, there are

149
00:04:51,910 --> 00:04:53,450
actually many different types of SPAM email.

150
00:04:53,930 --> 00:04:54,750
The SPAM email is trying to

151
00:04:55,030 --> 00:04:57,650
sell you things spam email, trying to steal your passwords,

152
00:04:58,470 --> 00:05:01,060
this is called fishing emails, and many different types of SPAM emails.

153
00:05:01,820 --> 00:05:03,490
But for the SPAM problem, we usually

154
00:05:03,930 --> 00:05:05,660
have enough examples of spam

155
00:05:06,000 --> 00:05:07,400
email to see, you know,

156
00:05:07,490 --> 00:05:08,650
most of these different types of

157
00:05:08,890 --> 00:05:10,200
SPAM email, because we have a

158
00:05:10,410 --> 00:05:11,650
large set of examples of

159
00:05:11,860 --> 00:05:13,050
SPAM, and that's why we

160
00:05:13,330 --> 00:05:14,800
usually think of SPAM as

161
00:05:14,980 --> 00:05:16,510
a supervised learning setting, even

162
00:05:16,710 --> 00:05:17,390
though, you know, there may be

163
00:05:17,530 --> 00:05:19,230
many different types of SPAM.

164
00:05:21,890 --> 00:05:23,170
And so, if we look at

165
00:05:23,310 --> 00:05:24,940
some applications of anomaly detection

166
00:05:25,600 --> 00:05:27,290
versus supervised learning, we'll find

167
00:05:27,480 --> 00:05:29,280
that, in fraud detection, if

168
00:05:29,410 --> 00:05:31,040
you have many different types

169
00:05:31,450 --> 00:05:32,510
of ways for people to

170
00:05:32,680 --> 00:05:34,120
try to commit fraud, and a

171
00:05:34,170 --> 00:05:35,730
relevantly small training set, a

172
00:05:35,880 --> 00:05:37,500
small number of fraudulent users

173
00:05:37,920 --> 00:05:40,300
on your website, then I would use an anomaly detection algorithm.

174
00:05:41,310 --> 00:05:42,520
I should say, if you

175
00:05:42,650 --> 00:05:44,560
have, if you are very a

176
00:05:44,700 --> 00:05:46,810
major online retailer, and

177
00:05:46,930 --> 00:05:48,170
if you actually have had a

178
00:05:48,330 --> 00:05:49,230
lot of people try to commit

179
00:05:49,390 --> 00:05:50,420
fraud on your website, so if

180
00:05:50,480 --> 00:05:51,340
you actually have a lot of

181
00:05:51,410 --> 00:05:53,760
examples where y equals 1, then

182
00:05:53,970 --> 00:05:55,410
you know, sometimes fraud detection

183
00:05:55,700 --> 00:05:58,030
could actually shift over to the supervised learning column.

184
00:05:58,850 --> 00:06:01,000
But, if you

185
00:06:01,210 --> 00:06:02,440
haven't seen that many

186
00:06:02,940 --> 00:06:04,480
examples of users doing

187
00:06:04,690 --> 00:06:05,720
strange things on your website

188
00:06:05,920 --> 00:06:07,970
then, more frequently, fraud detection

189
00:06:08,510 --> 00:06:09,730
is actually treated as an

190
00:06:09,990 --> 00:06:12,060
anomaly detection algorithm, rather than one of the supervised learning algorithm.

191
00:06:14,140 --> 00:06:15,160
Other examples, we talked about

192
00:06:15,310 --> 00:06:16,810
manufacturing already, hopefully you'll

193
00:06:16,950 --> 00:06:18,230
see more normal examples,

194
00:06:19,110 --> 00:06:19,840
not that many anomalies.

195
00:06:20,520 --> 00:06:21,560
But then again, for some manufacturing

196
00:06:22,180 --> 00:06:23,900
processes, if you're

197
00:06:23,990 --> 00:06:25,690
manufacturing very large volumes

198
00:06:25,860 --> 00:06:26,780
and you've seen a lot

199
00:06:27,230 --> 00:06:29,220
of bad examples, maybe manufacturing

200
00:06:29,790 --> 00:06:31,690
could shift to the supervised learning column as well.

201
00:06:32,630 --> 00:06:33,680
But, if you haven't seen that

202
00:06:33,950 --> 00:06:35,640
many bad examples of

203
00:06:35,830 --> 00:06:38,140
the old products, then I'll do this anomaly detection.

204
00:06:39,180 --> 00:06:40,290
Monitoring machines in the

205
00:06:40,400 --> 00:06:42,450
data center, again similar

206
00:06:42,880 --> 00:06:44,050
sorts of arguments apply.

207
00:06:45,280 --> 00:06:46,650
Whereas, email SPAM

208
00:06:47,070 --> 00:06:48,950
classification, weather prediction, and classifying

209
00:06:49,510 --> 00:06:50,580
cancers, if you have

210
00:06:51,200 --> 00:06:52,850
equal numbers of positive and

211
00:06:52,870 --> 00:06:53,920
negative examples, a lot of you

212
00:06:54,010 --> 00:06:55,550
have many examples of your

213
00:06:55,670 --> 00:06:56,780
positive and your negative

214
00:06:57,030 --> 00:06:57,870
examples, then, we would tend to

215
00:06:58,080 --> 00:07:00,630
treat all of these as supervised learning problems.

216
00:07:03,400 --> 00:07:04,500
So, hopefully, that gives you

217
00:07:04,580 --> 00:07:05,600
a sense of what are the

218
00:07:05,770 --> 00:07:07,050
properties of a learning

219
00:07:07,350 --> 00:07:08,980
problem that would cause you to

220
00:07:09,420 --> 00:07:10,410
treat it as an anomaly

221
00:07:10,810 --> 00:07:12,660
detention problem verses a supervised learning

222
00:07:14,250 --> 00:07:14,250
problem.

223
00:07:14,690 --> 00:07:16,020
And for many of the problems that are

224
00:07:16,260 --> 00:07:17,820
faced by various technology companies

225
00:07:18,200 --> 00:07:19,780
and so on, we actually are

226
00:07:19,860 --> 00:07:20,900
in these settings where we have

227
00:07:21,510 --> 00:07:23,320
very few or sometimes zero

228
00:07:24,060 --> 00:07:25,090
positive training examples,

229
00:07:25,400 --> 00:07:26,830
maybe there are so many

230
00:07:26,980 --> 00:07:28,410
different types of anomalies that we've never

231
00:07:28,530 --> 00:07:29,810
seen them before, and for those

232
00:07:29,960 --> 00:07:31,900
sorts of problems, very often,

233
00:07:32,440 --> 00:07:33,580
the algorithm that is used

234
00:07:33,790 --> 00:07:35,170
is an anomaly detection algorithm.