1
00:00:00,507 --> 00:00:04,464
[MUSIC]

2
00:00:04,464 --> 00:00:07,723
So we've talked about using k-NN for
regression, but

3
00:00:07,723 --> 00:00:12,920
these methods can also be very, very
straightforwardly used for classification.

4
00:00:12,920 --> 00:00:16,590
So this is a little warm-up for
the next course in this specialization,

5
00:00:16,590 --> 00:00:18,339
which is our classification course.

6
00:00:19,890 --> 00:00:23,437
And let's start out by just
recalling our classification task,

7
00:00:23,437 --> 00:00:26,737
where we're gonna do this in
the context of spam filtering.

8
00:00:26,737 --> 00:00:30,142
Where we have some email as our input, and

9
00:00:30,142 --> 00:00:35,016
the output is gonna be whether
the email is spam or not spam.

10
00:00:35,016 --> 00:00:39,160
And we're gonna make this decision
based on the text of the email.

11
00:00:39,160 --> 00:00:42,538
Maybe information about the sender,
IP and things like this.

12
00:00:42,538 --> 00:00:48,220
Well, what we can do is use k-NN for
classification.

13
00:00:48,220 --> 00:00:52,170
Visually we can think about just taking
all of the emails that we have labeled as

14
00:00:52,170 --> 00:00:56,770
spam or not spam and
throwing them down in some space.

15
00:00:56,770 --> 00:01:01,090
Where the distance between
emails in this space

16
00:01:01,090 --> 00:01:04,730
represents how similar the text or
the sender IP information is.

17
00:01:04,730 --> 00:01:09,220
All the inputs or the features we're
using to represent these emails.

18
00:01:09,220 --> 00:01:14,120
And then what we can do,
is we get some query email that comes in.

19
00:01:14,120 --> 00:01:16,649
So, that's this little gray email here.

20
00:01:16,649 --> 00:01:20,330
And we're gonna say,
is it spam or not spam?

21
00:01:20,330 --> 00:01:23,778
There's a very intuitive way to do this,
which is just search for

22
00:01:23,778 --> 00:01:25,796
the nearest neighbors of this email.

23
00:01:25,796 --> 00:01:27,897
The emails most similar to this email.

24
00:01:27,897 --> 00:01:30,918
And then we're just gonna
do a majority vote on

25
00:01:30,918 --> 00:01:36,340
the nearest neighbors to decide whether
this email is spam or not spam.

26
00:01:36,340 --> 00:01:41,514
And what we see in this case, is that
four of the neighbors are spam, and

27
00:01:41,514 --> 00:01:46,791
only one neighbor is not spam, so
we're gonna label this email as spam.

28
00:01:46,791 --> 00:01:51,446
And so this is the really, really
straightforward approach of using k-NN for

29
00:01:51,446 --> 00:01:52,585
classification.

30
00:01:52,585 --> 00:01:57,519
[MUSIC]