1
00:00:00,170 --> 00:00:01,190
In this next set of videos,

2
00:00:01,720 --> 00:00:02,680
I'd like to tell you about

3
00:00:03,050 --> 00:00:04,560
a problem called Anomaly Detection.

4
00:00:05,710 --> 00:00:07,220
This is a reasonably commonly

5
00:00:07,870 --> 00:00:08,740
use you type machine learning.

6
00:00:09,580 --> 00:00:10,990
And one of the interesting aspects

7
00:00:11,580 --> 00:00:13,250
is that it's mainly for

8
00:00:14,020 --> 00:00:15,860
unsupervised problem, that there's some

9
00:00:16,320 --> 00:00:17,240
aspects of it that are

10
00:00:17,510 --> 00:00:20,000
also very similar to sort of the supervised learning problem.

11
00:00:21,160 --> 00:00:22,440
So, what is anomaly detection?

12
00:00:23,380 --> 00:00:25,000
To explain it. Let me use

13
00:00:25,240 --> 00:00:27,780
the motivating example of: Imagine

14
00:00:28,440 --> 00:00:30,040
that you're a manufacturer of

15
00:00:30,330 --> 00:00:32,370
aircraft engines, and let's

16
00:00:32,600 --> 00:00:33,850
say that as your aircraft

17
00:00:34,280 --> 00:00:35,330
engines roll off the assembly

18
00:00:35,620 --> 00:00:37,580
line, you're doing, you know, QA or

19
00:00:37,820 --> 00:00:39,850
quality assurance testing, and as

20
00:00:40,030 --> 00:00:41,340
part of that testing you

21
00:00:41,410 --> 00:00:43,140
measure features of your

22
00:00:43,510 --> 00:00:44,900
aircraft engine, like maybe, you measure

23
00:00:45,180 --> 00:00:46,820
the heat generated, things like

24
00:00:46,860 --> 00:00:48,340
the vibrations and so on.

25
00:00:48,630 --> 00:00:49,570
I share some friends that worked

26
00:00:49,860 --> 00:00:50,940
on this problem a long time

27
00:00:51,010 --> 00:00:52,610
ago, and these were actually the

28
00:00:52,710 --> 00:00:53,960
sorts of features that they were

29
00:00:54,470 --> 00:00:55,910
collecting off actual aircraft

30
00:00:56,280 --> 00:00:58,540
engines so you

31
00:00:58,630 --> 00:00:59,570
now have a data set of

32
00:00:59,700 --> 00:01:01,000
X1 through Xm, if you have

33
00:01:01,760 --> 00:01:04,490
manufactured m aircraft engines,

34
00:01:05,030 --> 00:01:06,740
and if you plot your data, maybe it looks like this.

35
00:01:07,130 --> 00:01:08,640
So, each point here, each cross

36
00:01:08,770 --> 00:01:10,580
here as one of your unlabeled examples.

37
00:01:11,990 --> 00:01:15,220
So, the anomaly detection problem is the following.

38
00:01:16,450 --> 00:01:17,770
Let's say that on, you

39
00:01:17,880 --> 00:01:18,970
know, the next day, you

40
00:01:19,140 --> 00:01:20,390
have a new aircraft engine

41
00:01:20,810 --> 00:01:21,860
that rolls off the assembly line

42
00:01:22,320 --> 00:01:23,890
and your new aircraft engine has

43
00:01:24,160 --> 00:01:25,440
some set of features x-test.

44
00:01:26,290 --> 00:01:27,680
What the anomaly detection problem is,

45
00:01:27,930 --> 00:01:29,070
we want to know if this

46
00:01:29,420 --> 00:01:31,310
aircraft engine is anomalous in

47
00:01:31,520 --> 00:01:32,480
any way, in other words, we want

48
00:01:32,740 --> 00:01:34,110
to know if, maybe, this engine

49
00:01:34,570 --> 00:01:36,290
should undergo further testing

50
00:01:37,330 --> 00:01:38,370
because, or if it looks

51
00:01:38,710 --> 00:01:40,560
like an okay engine, and

52
00:01:40,740 --> 00:01:41,700
so it's okay to just ship

53
00:01:41,880 --> 00:01:43,260
it to a customer without further testing.

54
00:01:44,560 --> 00:01:45,670
So, if your new

55
00:01:45,840 --> 00:01:47,330
aircraft engine looks like

56
00:01:47,540 --> 00:01:49,150
a point over there, well, you

57
00:01:49,260 --> 00:01:50,200
know, that looks a lot

58
00:01:50,360 --> 00:01:51,440
like the aircraft engines we've seen

59
00:01:51,650 --> 00:01:53,860
before, and so maybe we'll say that it looks okay.

60
00:01:54,750 --> 00:01:55,740
Whereas, if your new aircraft

61
00:01:56,200 --> 00:01:59,390
engine, if x-test, you know, were

62
00:01:59,620 --> 00:02:00,430
a point that were out here,

63
00:02:00,910 --> 00:02:02,270
so that if X1 and

64
00:02:02,410 --> 00:02:04,800
X2 are the features of this new example.

65
00:02:05,360 --> 00:02:06,530
If x-tests were all the

66
00:02:06,590 --> 00:02:08,930
way out there, then we would call that an anomaly.

67
00:02:10,420 --> 00:02:11,640
and maybe send that aircraft engine

68
00:02:12,070 --> 00:02:13,720
for further testing before we

69
00:02:13,870 --> 00:02:15,130
ship it to a customer, since

70
00:02:16,010 --> 00:02:18,340
it looks very different than

71
00:02:18,600 --> 00:02:20,350
the rest of the aircraft engines we've seen before.

72
00:02:21,000 --> 00:02:22,560
More formally in the anomaly

73
00:02:22,960 --> 00:02:24,230
detection problem, we're give

74
00:02:24,900 --> 00:02:26,160
some data sets, x1 through

75
00:02:26,280 --> 00:02:28,340
Xm of examples, and we

76
00:02:28,460 --> 00:02:29,720
usually assume that these end

77
00:02:29,880 --> 00:02:32,250
examples are normal or

78
00:02:33,120 --> 00:02:34,910
non-anomalous examples, and we

79
00:02:34,980 --> 00:02:36,100
want an algorithm to tell us

80
00:02:36,290 --> 00:02:38,300
if some new example x-test is anomalous.

81
00:02:38,850 --> 00:02:40,080
The approach that we're going

82
00:02:40,130 --> 00:02:41,670
to take is that given this training

83
00:02:42,060 --> 00:02:43,300
set, given the unlabeled training

84
00:02:43,690 --> 00:02:45,280
set, we're going to

85
00:02:45,420 --> 00:02:46,920
build a model for p of

86
00:02:47,020 --> 00:02:48,060
x. In other words, we're

87
00:02:48,140 --> 00:02:49,320
going to build a model for the

88
00:02:49,520 --> 00:02:51,230
probability of x, where

89
00:02:51,390 --> 00:02:53,330
x are these features of, say, aircraft engines.

90
00:02:54,620 --> 00:02:56,290
And so, having built a

91
00:02:56,530 --> 00:02:57,350
model of the probability of x

92
00:02:58,070 --> 00:02:59,230
we're then going to say that

93
00:02:59,820 --> 00:03:01,280
for the new aircraft engine, if

94
00:03:01,520 --> 00:03:04,670
p of x-test is less

95
00:03:04,920 --> 00:03:07,180
than some epsilon then

96
00:03:07,930 --> 00:03:09,170
we flag this as an anomaly.

97
00:03:11,410 --> 00:03:12,260
So we see a new engine

98
00:03:12,660 --> 00:03:13,960
that, you know, has very low probability

99
00:03:14,850 --> 00:03:15,900
under a model p of

100
00:03:16,020 --> 00:03:17,130
x that we estimate from the data,

101
00:03:17,790 --> 00:03:19,370
then we flag this anomaly, whereas

102
00:03:19,730 --> 00:03:21,880
if p of x-test is, say,

103
00:03:22,320 --> 00:03:24,110
greater than or equal to some small threshold.

104
00:03:25,120 --> 00:03:26,620
Then we say that, you know, okay, it looks okay.

105
00:03:27,780 --> 00:03:28,740
And so, given the training set,

106
00:03:28,980 --> 00:03:30,890
like that plotted here, if

107
00:03:31,060 --> 00:03:31,940
you build a model, hopefully

108
00:03:32,560 --> 00:03:34,020
you will find that aircraft engines,

109
00:03:34,470 --> 00:03:35,500
or hopefully the model p of

110
00:03:35,560 --> 00:03:37,070
x will say that points that

111
00:03:37,260 --> 00:03:38,540
lie, you know, somewhere in the

112
00:03:38,580 --> 00:03:39,550
middle, that's pretty high probability,

113
00:03:40,720 --> 00:03:42,830
whereas points a little bit further out have lower probability.

114
00:03:43,850 --> 00:03:45,050
Points that are even further out

115
00:03:45,530 --> 00:03:47,220
have somewhat lower probability, and the

116
00:03:47,480 --> 00:03:48,420
point that's way out here,

117
00:03:49,080 --> 00:03:50,400
the point that's way

118
00:03:50,520 --> 00:03:52,100
out there, would be an anomaly.

119
00:03:54,150 --> 00:03:55,280
Whereas the point that's way in

120
00:03:55,470 --> 00:03:56,460
there, right in the

121
00:03:56,520 --> 00:03:57,720
middle, this would be

122
00:03:57,830 --> 00:03:59,080
okay because p of x

123
00:03:59,370 --> 00:04:00,300
right in the middle of that

124
00:04:00,460 --> 00:04:01,320
would be very high cause we've

125
00:04:01,520 --> 00:04:03,320
seen a lot of points in that region.

126
00:04:04,620 --> 00:04:07,580
Here are some examples of applications of anomaly detection.

127
00:04:08,450 --> 00:04:09,990
Perhaps the most common application of

128
00:04:10,080 --> 00:04:11,420
anomaly detection is actually

129
00:04:11,560 --> 00:04:13,260
for detection if you

130
00:04:13,360 --> 00:04:14,820
have many users, and if

131
00:04:15,070 --> 00:04:16,360
each of your users take different

132
00:04:16,670 --> 00:04:17,740
activities, you know maybe

133
00:04:17,920 --> 00:04:18,560
on your website or in the

134
00:04:18,630 --> 00:04:20,180
physical plant or something, you

135
00:04:20,300 --> 00:04:23,670
can compute features of the different users activities.

136
00:04:24,830 --> 00:04:25,730
And what you can do is build

137
00:04:25,940 --> 00:04:27,240
a model to say, you know,

138
00:04:27,310 --> 00:04:28,960
what is the probability of different

139
00:04:29,170 --> 00:04:30,730
users behaving different ways.

140
00:04:30,890 --> 00:04:32,280
What is the probability of a particular vector

141
00:04:32,460 --> 00:04:34,590
of features of a

142
00:04:34,840 --> 00:04:36,750
users behavior so you

143
00:04:36,900 --> 00:04:38,360
know examples of features of

144
00:04:38,450 --> 00:04:40,480
a users activity may be on

145
00:04:40,650 --> 00:04:41,650
the website it'd be things like,

146
00:04:42,710 --> 00:04:44,350
maybe x1 is how often does

147
00:04:44,840 --> 00:04:46,460
this user log in, x2, you know, maybe

148
00:04:46,850 --> 00:04:47,920
the number of what

149
00:04:48,130 --> 00:04:49,330
pages visited, or the

150
00:04:49,730 --> 00:04:51,420
number of transactions, maybe x3

151
00:04:51,440 --> 00:04:52,820
is, you know, the number of

152
00:04:53,120 --> 00:04:53,990
posts of the users on the

153
00:04:54,130 --> 00:04:55,850
forum, feature x4 could

154
00:04:56,000 --> 00:04:56,910
be what is the typing

155
00:04:57,440 --> 00:04:58,660
speed of the user and some

156
00:04:58,920 --> 00:04:59,980
websites can actually track that

157
00:05:00,280 --> 00:05:01,410
was the typing speed of this

158
00:05:01,600 --> 00:05:03,010
user in characters per second.

159
00:05:03,730 --> 00:05:06,610
And so you can model p of x based on this sort of data.

160
00:05:08,150 --> 00:05:09,140
And finally having your model

161
00:05:09,270 --> 00:05:10,530
p of x, you can

162
00:05:10,790 --> 00:05:12,570
try to identify users that

163
00:05:12,760 --> 00:05:14,210
are behaving very strangely on your

164
00:05:14,350 --> 00:05:15,590
website by checking which ones have

165
00:05:16,320 --> 00:05:18,100
probably effects less than epsilon and

166
00:05:18,240 --> 00:05:21,140
maybe send the profiles of those users for further review.

167
00:05:22,330 --> 00:05:24,560
Or demand additional identification from

168
00:05:24,740 --> 00:05:26,190
those users, or some such

169
00:05:26,650 --> 00:05:28,370
to guard against you know,

170
00:05:29,200 --> 00:05:31,650
strange behavior or fraudulent behavior on your website.

171
00:05:33,030 --> 00:05:34,960
This sort of technique will tend

172
00:05:35,160 --> 00:05:36,470
of flag the users that are

173
00:05:36,720 --> 00:05:38,250
behaving unusually, not just

174
00:05:39,480 --> 00:05:41,420
users that maybe behaving fraudulently.

175
00:05:42,190 --> 00:05:44,030
So not just constantly having

176
00:05:44,370 --> 00:05:45,670
stolen or users that are

177
00:05:45,780 --> 00:05:47,780
trying to do funny things, or just find unusual users.

178
00:05:48,560 --> 00:05:49,770
But this is actually the technique

179
00:05:50,040 --> 00:05:51,430
that is used by many online

180
00:05:52,500 --> 00:05:53,570
websites that sell things to

181
00:05:53,750 --> 00:05:55,860
try identify users behaving

182
00:05:56,240 --> 00:05:57,900
strangely that might be

183
00:05:58,040 --> 00:05:59,160
indicative of either fraudulent

184
00:05:59,760 --> 00:06:02,420
behavior or of computer accounts that have been stolen.

185
00:06:03,580 --> 00:06:06,410
Another example of anomaly detection is manufacturing.

186
00:06:07,180 --> 00:06:08,470
So, already talked about the

187
00:06:08,530 --> 00:06:09,770
aircraft engine thing where you can

188
00:06:10,030 --> 00:06:11,460
find unusual, say, aircraft

189
00:06:11,900 --> 00:06:13,600
engines and send those for further review.

190
00:06:15,430 --> 00:06:16,740
A third application would be

191
00:06:17,070 --> 00:06:19,210
monitoring computers in a data center.

192
00:06:19,390 --> 00:06:20,410
I actually have some friends who work on this too.

193
00:06:21,260 --> 00:06:22,280
So if you have a lot

194
00:06:22,580 --> 00:06:23,550
of machines in a computer

195
00:06:23,730 --> 00:06:24,690
cluster or in a

196
00:06:24,780 --> 00:06:25,710
data center, we can do

197
00:06:25,920 --> 00:06:28,560
things like compute features at each machine.

198
00:06:29,020 --> 00:06:30,650
So maybe some features capturing

199
00:06:31,170 --> 00:06:32,730
you know, how much memory used, number of

200
00:06:32,870 --> 00:06:34,280
disc accesses, CPU load.

201
00:06:35,060 --> 00:06:36,050
As well as more complex features

202
00:06:36,440 --> 00:06:37,450
like what is the CPU

203
00:06:37,830 --> 00:06:39,650
load on this machine divided by

204
00:06:39,960 --> 00:06:41,340
the amount of network traffic

205
00:06:41,950 --> 00:06:43,050
on this machine?

206
00:06:43,340 --> 00:06:44,580
Then given the dataset of how

207
00:06:44,820 --> 00:06:45,780
your computers in your data

208
00:06:46,070 --> 00:06:47,230
center usually behave, you can

209
00:06:47,390 --> 00:06:48,460
model the probability of x,

210
00:06:48,590 --> 00:06:49,730
so you can model the probability

211
00:06:50,350 --> 00:06:51,840
of these machines having

212
00:06:52,840 --> 00:06:53,790
different amounts of memory use

213
00:06:54,060 --> 00:06:55,200
or probability of these machines having

214
00:06:55,920 --> 00:06:57,160
different numbers of disc accesses

215
00:06:57,780 --> 00:06:59,880
or different CPU loads and so on.

216
00:07:00,030 --> 00:07:01,100
And if you ever have a machine

217
00:07:02,030 --> 00:07:03,530
whose probability of x,

218
00:07:03,800 --> 00:07:05,330
p of x, is very small then you

219
00:07:05,440 --> 00:07:06,880
know that machine is behaving unusually

220
00:07:07,970 --> 00:07:08,950
and maybe that machine is

221
00:07:09,050 --> 00:07:11,630
about to go down, and you

222
00:07:11,700 --> 00:07:13,620
can flag that for review by a system administrator.

223
00:07:14,690 --> 00:07:15,890
And this is actually being used

224
00:07:16,060 --> 00:07:17,800
today by various data

225
00:07:18,020 --> 00:07:19,550
centers to watch out for unusual

226
00:07:20,040 --> 00:07:21,430
things happening on their machines.

227
00:07:22,920 --> 00:07:24,420
So, that's anomaly detection.

228
00:07:25,540 --> 00:07:26,880
In the next video, I'll

229
00:07:27,120 --> 00:07:29,400
talk a bit about the Gaussian distribution and

230
00:07:29,580 --> 00:07:31,030
review properties of the Gaussian

231
00:07:31,580 --> 00:07:33,540
probability distribution, and in

232
00:07:33,690 --> 00:07:34,650
videos after that, we will

233
00:07:34,790 --> 00:07:37,390
apply it to develop an anomaly detection algorithm.