1 00:00:00,507 --> 00:00:04,464 [MUSIC] 2 00:00:04,464 --> 00:00:07,723 So we've talked about using k-NN for regression, but 3 00:00:07,723 --> 00:00:12,920 these methods can also be very, very straightforwardly used for classification. 4 00:00:12,920 --> 00:00:16,590 So this is a little warm-up for the next course in this specialization, 5 00:00:16,590 --> 00:00:18,339 which is our classification course. 6 00:00:19,890 --> 00:00:23,437 And let's start out by just recalling our classification task, 7 00:00:23,437 --> 00:00:26,737 where we're gonna do this in the context of spam filtering. 8 00:00:26,737 --> 00:00:30,142 Where we have some email as our input, and 9 00:00:30,142 --> 00:00:35,016 the output is gonna be whether the email is spam or not spam. 10 00:00:35,016 --> 00:00:39,160 And we're gonna make this decision based on the text of the email. 11 00:00:39,160 --> 00:00:42,538 Maybe information about the sender, IP and things like this. 12 00:00:42,538 --> 00:00:48,220 Well, what we can do is use k-NN for classification. 13 00:00:48,220 --> 00:00:52,170 Visually we can think about just taking all of the emails that we have labeled as 14 00:00:52,170 --> 00:00:56,770 spam or not spam and throwing them down in some space. 15 00:00:56,770 --> 00:01:01,090 Where the distance between emails in this space 16 00:01:01,090 --> 00:01:04,730 represents how similar the text or the sender IP information is. 17 00:01:04,730 --> 00:01:09,220 All the inputs or the features we're using to represent these emails. 18 00:01:09,220 --> 00:01:14,120 And then what we can do, is we get some query email that comes in. 19 00:01:14,120 --> 00:01:16,649 So, that's this little gray email here. 20 00:01:16,649 --> 00:01:20,330 And we're gonna say, is it spam or not spam? 21 00:01:20,330 --> 00:01:23,778 There's a very intuitive way to do this, which is just search for 22 00:01:23,778 --> 00:01:25,796 the nearest neighbors of this email. 23 00:01:25,796 --> 00:01:27,897 The emails most similar to this email. 24 00:01:27,897 --> 00:01:30,918 And then we're just gonna do a majority vote on 25 00:01:30,918 --> 00:01:36,340 the nearest neighbors to decide whether this email is spam or not spam. 26 00:01:36,340 --> 00:01:41,514 And what we see in this case, is that four of the neighbors are spam, and 27 00:01:41,514 --> 00:01:46,791 only one neighbor is not spam, so we're gonna label this email as spam. 28 00:01:46,791 --> 00:01:51,446 And so this is the really, really straightforward approach of using k-NN for 29 00:01:51,446 --> 00:01:52,585 classification. 30 00:01:52,585 --> 00:01:57,519 [MUSIC]