1 00:00:01,180 --> 00:00:02,410 In the next few videos I'd 2 00:00:02,560 --> 00:00:04,780 like to talk about machine learning system design. 3 00:00:05,780 --> 00:00:06,950 These videos will touch on 4 00:00:07,190 --> 00:00:08,370 the main issues that you may 5 00:00:08,540 --> 00:00:10,140 face when designing a 6 00:00:10,220 --> 00:00:11,450 complex machine learning system, 7 00:00:12,470 --> 00:00:13,310 and will actually try to give 8 00:00:13,490 --> 00:00:14,680 advice on how to 9 00:00:14,780 --> 00:00:17,580 strategize putting together a complex machine learning system. 10 00:00:18,970 --> 00:00:20,190 In case this next set 11 00:00:20,340 --> 00:00:21,390 of videos seems a little 12 00:00:21,530 --> 00:00:23,140 disjointed that's because these 13 00:00:23,360 --> 00:00:24,340 videos will touch on a 14 00:00:24,450 --> 00:00:25,800 range of the different issues that 15 00:00:26,150 --> 00:00:28,220 you may come across when designing complex learning systems. 16 00:00:29,600 --> 00:00:31,080 And even though the 17 00:00:31,160 --> 00:00:32,270 next set of videos may seem 18 00:00:32,560 --> 00:00:34,740 somewhat less mathematical, I think 19 00:00:35,050 --> 00:00:36,180 that this material may turn 20 00:00:36,500 --> 00:00:38,280 out to be very useful, and 21 00:00:38,400 --> 00:00:39,650 potentially huge time savers 22 00:00:40,120 --> 00:00:41,610 when you're building big machine learning systems. 23 00:00:42,890 --> 00:00:44,140 Concretely, I'd like to 24 00:00:44,260 --> 00:00:45,710 begin with the issue of 25 00:00:46,330 --> 00:00:47,500 prioritizing how to spend 26 00:00:47,790 --> 00:00:48,680 your time on what to work 27 00:00:48,980 --> 00:00:50,330 on, and I'll begin 28 00:00:50,740 --> 00:00:52,220 with an example on spam classification. 29 00:00:55,580 --> 00:00:57,280 Let's say you want to build a spam classifier. 30 00:00:58,540 --> 00:00:59,740 Here are a couple of examples 31 00:01:00,180 --> 00:01:02,340 of obvious spam and non-spam emails. 32 00:01:03,400 --> 00:01:05,350 if the one on the left tried to sell things. 33 00:01:06,270 --> 00:01:07,640 And notice how spammers 34 00:01:08,470 --> 00:01:10,080 will deliberately misspell words, 35 00:01:10,540 --> 00:01:13,470 like Vincent with a 1 there, and mortgages. 36 00:01:14,850 --> 00:01:16,350 And on the right as maybe 37 00:01:16,560 --> 00:01:17,760 an obvious example of non-stamp 38 00:01:18,480 --> 00:01:20,680 email, actually email from my younger brother. 39 00:01:21,710 --> 00:01:22,940 Let's say we have a labeled 40 00:01:23,350 --> 00:01:24,560 training set of some 41 00:01:24,860 --> 00:01:26,130 number of spam emails and 42 00:01:26,240 --> 00:01:28,200 some non-spam emails denoted 43 00:01:28,240 --> 00:01:30,780 with labels y equals 1 or 0, 44 00:01:31,290 --> 00:01:32,600 how do we build a 45 00:01:33,110 --> 00:01:34,900 classifier using supervised learning 46 00:01:35,230 --> 00:01:37,130 to distinguish between spam and non-spam? 47 00:01:38,130 --> 00:01:39,670 In order to apply supervised learning, 48 00:01:40,340 --> 00:01:41,430 the first decision we must 49 00:01:41,660 --> 00:01:43,190 make is how do 50 00:01:43,270 --> 00:01:44,860 we want to represent x, that 51 00:01:45,260 --> 00:01:46,590 is the features of the email. 52 00:01:47,430 --> 00:01:48,900 Given the features x and 53 00:01:49,160 --> 00:01:50,290 the labels y in our 54 00:01:50,410 --> 00:01:51,510 training set, we can then 55 00:01:51,720 --> 00:01:54,660 train a classifier, for example using logistic regression. 56 00:01:56,150 --> 00:01:57,120 Here's one way to choose 57 00:01:57,550 --> 00:01:59,630 a set of features for our emails. 58 00:02:00,850 --> 00:02:01,930 We could come up with, 59 00:02:02,280 --> 00:02:03,630 say, a list of maybe 60 00:02:03,870 --> 00:02:05,170 a hundred words that we think 61 00:02:05,440 --> 00:02:06,850 are indicative of whether e-mail 62 00:02:07,190 --> 00:02:09,230 is spam or non-spam, for 63 00:02:09,370 --> 00:02:10,410 example, if a piece of 64 00:02:10,580 --> 00:02:11,640 e-mail contains the word 'deal' 65 00:02:12,340 --> 00:02:13,350 maybe it's more likely to be 66 00:02:13,460 --> 00:02:14,410 spam if it contains 67 00:02:14,850 --> 00:02:16,280 the word 'buy' maybe more 68 00:02:16,450 --> 00:02:17,670 likely to be spam, a 69 00:02:17,990 --> 00:02:19,340 word like 'discount' is more 70 00:02:19,580 --> 00:02:20,900 likely to be spam, whereas if 71 00:02:21,080 --> 00:02:22,340 a piece of email contains my name, 72 00:02:23,920 --> 00:02:25,350 Andrew, maybe that means 73 00:02:25,630 --> 00:02:26,870 the person actually knows who 74 00:02:26,910 --> 00:02:27,740 I am and that might mean it's 75 00:02:27,900 --> 00:02:30,090 less likely to be spam. 76 00:02:31,470 --> 00:02:32,580 And maybe for some reason I think 77 00:02:32,840 --> 00:02:33,990 the word "now" may be 78 00:02:34,260 --> 00:02:35,680 indicative of non-spam because 79 00:02:35,980 --> 00:02:37,150 I get a lot of urgent 80 00:02:37,540 --> 00:02:39,370 emails, and so on, 81 00:02:39,520 --> 00:02:41,220 and maybe we choose a hundred words or so. 82 00:02:42,380 --> 00:02:43,510 Given a piece of email, 83 00:02:43,970 --> 00:02:44,930 we can then take this piece 84 00:02:45,180 --> 00:02:46,220 of email and encode 85 00:02:46,640 --> 00:02:47,970 it into a feature 86 00:02:48,290 --> 00:02:49,930 vector as follows. 87 00:02:50,810 --> 00:02:51,450 I'm going to take my list of a 88 00:02:51,720 --> 00:02:54,560 hundred words and sort 89 00:02:54,960 --> 00:02:56,620 them in alphabetical order say. 90 00:02:57,210 --> 00:02:57,980 It doesn't have to be sorted. 91 00:02:58,450 --> 00:02:59,910 But, you know, here's a, here's 92 00:03:00,110 --> 00:03:02,080 my list of words, just count 93 00:03:02,710 --> 00:03:03,950 and so on, until eventually 94 00:03:04,160 --> 00:03:05,430 I'll get down to now, and so 95 00:03:06,080 --> 00:03:07,230 on and given a piece 96 00:03:07,350 --> 00:03:08,540 of e-mail like that shown on the 97 00:03:08,610 --> 00:03:09,640 right, I'm going to 98 00:03:09,770 --> 00:03:11,400 check and see whether or 99 00:03:11,450 --> 00:03:12,560 not each of these words 100 00:03:13,030 --> 00:03:14,560 appears in the e-mail and then 101 00:03:14,810 --> 00:03:16,400 I'm going to define a feature 102 00:03:16,580 --> 00:03:19,130 vector x where in 103 00:03:19,260 --> 00:03:20,260 this piece of an email on 104 00:03:20,340 --> 00:03:21,520 the right, my name doesn't 105 00:03:21,930 --> 00:03:23,210 appear so I'm gonna put a zero there. 106 00:03:24,070 --> 00:03:25,410 The word "by" does appear, 107 00:03:26,790 --> 00:03:27,690 so I'm gonna put a one there 108 00:03:28,090 --> 00:03:29,450 and I'm just gonna put one's or zeroes. 109 00:03:30,170 --> 00:03:31,550 I'm gonna put a 110 00:03:31,730 --> 00:03:33,950 one even though the word "by" occurs twice. 111 00:03:34,600 --> 00:03:36,490 I'm not gonna recount how many times the word occurs. 112 00:03:37,590 --> 00:03:40,280 The word "due" appears, I put a one there. 113 00:03:40,900 --> 00:03:42,450 The word "discount" doesn't appear, at 114 00:03:42,620 --> 00:03:43,680 least not in this this little 115 00:03:44,520 --> 00:03:46,140 short email, and so on. 116 00:03:46,810 --> 00:03:48,740 The word "now" does appear and so on. 117 00:03:48,870 --> 00:03:50,250 So I put ones and zeroes 118 00:03:50,560 --> 00:03:52,560 in this feature vector depending on 119 00:03:52,720 --> 00:03:54,230 whether or not a particular word appears. 120 00:03:55,060 --> 00:03:56,740 And in this example my 121 00:03:56,870 --> 00:03:58,850 feature vector would have 122 00:03:59,110 --> 00:04:00,920 to mention one hundred, 123 00:04:02,310 --> 00:04:03,960 if I have a hundred, 124 00:04:04,310 --> 00:04:05,460 if if I chose a hundred 125 00:04:05,650 --> 00:04:06,850 words to use for 126 00:04:07,010 --> 00:04:08,980 this representation and each 127 00:04:09,240 --> 00:04:13,060 of my features Xj will 128 00:04:13,300 --> 00:04:15,150 basically be 1 if 129 00:04:16,360 --> 00:04:17,410 you have a particular word that, 130 00:04:17,490 --> 00:04:18,930 we'll call this word j, appears 131 00:04:19,420 --> 00:04:20,940 in the email and Xj 132 00:04:22,400 --> 00:04:23,910 would be zero otherwise. 133 00:04:25,700 --> 00:04:25,700 Okay. 134 00:04:25,900 --> 00:04:27,440 So that gives me 135 00:04:27,600 --> 00:04:30,220 a feature representation of a piece of email. 136 00:04:30,920 --> 00:04:32,060 By the way, even though I've 137 00:04:32,200 --> 00:04:34,260 described this process as manually 138 00:04:34,920 --> 00:04:36,790 picking a hundred words, in 139 00:04:36,950 --> 00:04:38,510 practice what's most commonly 140 00:04:38,940 --> 00:04:40,140 done is to look through 141 00:04:40,400 --> 00:04:42,710 a training set, and in 142 00:04:42,800 --> 00:04:43,980 the training set depict the 143 00:04:44,050 --> 00:04:45,690 most frequently occurring n words 144 00:04:46,080 --> 00:04:47,290 where n is usually between ten 145 00:04:47,450 --> 00:04:49,310 thousand and fifty thousand, and use 146 00:04:49,550 --> 00:04:50,810 those as your features. 147 00:04:51,630 --> 00:04:52,910 So rather than manually picking a 148 00:04:53,090 --> 00:04:54,220 hundred words, here you look 149 00:04:54,390 --> 00:04:56,030 through the training examples and 150 00:04:56,130 --> 00:04:57,570 pick the most frequently occurring words 151 00:04:57,930 --> 00:04:58,860 like ten thousand to fifty thousand 152 00:04:59,260 --> 00:05:00,680 words, and those form the 153 00:05:00,820 --> 00:05:01,550 features that you are going 154 00:05:01,640 --> 00:05:04,320 to use to represent your email for spam classification. 155 00:05:05,450 --> 00:05:06,850 Now, if you're building a 156 00:05:06,910 --> 00:05:09,020 spam classifier one question 157 00:05:09,570 --> 00:05:11,260 that you may face is, what's 158 00:05:11,500 --> 00:05:12,580 the best use of your time 159 00:05:13,230 --> 00:05:14,820 in order to make your 160 00:05:14,980 --> 00:05:17,510 spam classifier have higher accuracy, you have lower error. 161 00:05:18,910 --> 00:05:21,350 One natural inclination is going to collect lots of data. 162 00:05:21,780 --> 00:05:21,780 Right? 163 00:05:22,030 --> 00:05:23,120 And in fact there's this tendency 164 00:05:23,700 --> 00:05:24,670 to think that, well the 165 00:05:24,740 --> 00:05:26,590 more data we have the better the algorithm will do. 166 00:05:27,560 --> 00:05:28,850 And in fact, in the email 167 00:05:29,100 --> 00:05:30,500 spam domain, there are actually 168 00:05:31,310 --> 00:05:32,840 pretty serious projects called Honey 169 00:05:33,180 --> 00:05:35,310 Pot Projects, which create fake 170 00:05:35,710 --> 00:05:37,090 email addresses and try to 171 00:05:37,200 --> 00:05:38,910 get these fake email addresses into 172 00:05:39,140 --> 00:05:40,710 the hands of spammers and use 173 00:05:40,910 --> 00:05:41,870 that to try to collect tons 174 00:05:42,140 --> 00:05:43,440 of spam email, and therefore 175 00:05:44,120 --> 00:05:46,120 you know, get a lot of spam data to train learning algorithms. 176 00:05:47,060 --> 00:05:48,760 But we've already seen in the 177 00:05:49,150 --> 00:05:50,630 previous sets of videos 178 00:05:50,650 --> 00:05:53,340 that getting lots of data will often help, but not all the time. 179 00:05:54,600 --> 00:05:55,810 But for most machine learning problems, 180 00:05:56,430 --> 00:05:57,280 there are a lot of other things 181 00:05:57,620 --> 00:05:59,780 you could usually imagine doing to improve performance. 182 00:06:00,970 --> 00:06:02,130 For spam, one thing you 183 00:06:02,230 --> 00:06:03,420 might think of is to develop 184 00:06:03,960 --> 00:06:05,620 more sophisticated features on the 185 00:06:05,820 --> 00:06:08,070 email, maybe based on the email routing information. 186 00:06:09,850 --> 00:06:11,920 And this would be information contained in the email header. 187 00:06:13,130 --> 00:06:14,820 So, when spammers send email, 188 00:06:15,250 --> 00:06:16,420 very often they will try 189 00:06:16,690 --> 00:06:18,110 to obscure the origins of 190 00:06:18,340 --> 00:06:20,260 the email, and maybe 191 00:06:20,470 --> 00:06:21,880 use fake email headers. 192 00:06:22,900 --> 00:06:24,060 Or send email through very 193 00:06:24,410 --> 00:06:26,360 unusual sets of computer service. 194 00:06:27,060 --> 00:06:29,880 Through very unusual routes, in order to get the spam to you. 195 00:06:30,490 --> 00:06:33,690 And some of this information will be reflected in the email header. 196 00:06:35,000 --> 00:06:36,600 And so one can imagine, 197 00:06:38,070 --> 00:06:39,300 looking at the email headers and 198 00:06:39,410 --> 00:06:41,060 trying to develop more sophisticated features 199 00:06:41,510 --> 00:06:42,760 to capture this sort of 200 00:06:43,010 --> 00:06:45,770 email routing information to identify if something is spam. 201 00:06:46,650 --> 00:06:47,890 Something else you might consider doing 202 00:06:48,380 --> 00:06:49,300 is to look at the 203 00:06:49,640 --> 00:06:50,860 email message body, that is 204 00:06:51,100 --> 00:06:54,350 the email text, and try to develop more sophisticated features. 205 00:06:55,320 --> 00:06:56,310 For example, should the word 206 00:06:56,500 --> 00:06:57,560 'discount' and the word 207 00:06:57,690 --> 00:06:59,340 'discounts' be treated as 208 00:06:59,550 --> 00:07:01,810 the same words or should 209 00:07:02,240 --> 00:07:04,120 we have treat the words 'deal' and 'dealer' as the same word? 210 00:07:04,380 --> 00:07:05,610 Maybe even though one is 211 00:07:06,130 --> 00:07:08,020 lower case and one in capitalized in this example. 212 00:07:08,740 --> 00:07:10,530 Or do we want more complex features about punctuation because maybe spam 213 00:07:12,740 --> 00:07:14,110 is using exclamation marks a lot more. 214 00:07:14,450 --> 00:07:14,730 I don't know. 215 00:07:15,580 --> 00:07:16,850 And along the same lines, maybe 216 00:07:17,170 --> 00:07:18,560 we also want to develop more 217 00:07:18,750 --> 00:07:20,380 sophisticated algorithms to detect 218 00:07:21,120 --> 00:07:22,700 and maybe to correct to deliberate misspellings, 219 00:07:23,360 --> 00:07:24,700 like mortgage, medicine, watches. 220 00:07:25,700 --> 00:07:26,890 Because spammers actually do this, 221 00:07:27,150 --> 00:07:28,400 because if you have watches 222 00:07:29,420 --> 00:07:31,060 with a 4 in there then well, 223 00:07:31,450 --> 00:07:32,720 with the simple technique that we 224 00:07:32,840 --> 00:07:34,760 talked about just now, the spam 225 00:07:35,090 --> 00:07:36,280 classifier might not equate 226 00:07:36,800 --> 00:07:38,170 this as the same thing as the 227 00:07:38,230 --> 00:07:40,260 word "watches," and so it 228 00:07:40,390 --> 00:07:41,430 may have a harder time realizing 229 00:07:42,000 --> 00:07:43,930 that something is spam with these deliberate misspellings. 230 00:07:44,830 --> 00:07:45,940 And this is why spammers do it. 231 00:07:48,230 --> 00:07:49,280 While working on a machine learning 232 00:07:49,630 --> 00:07:51,370 problem, very often you 233 00:07:51,480 --> 00:07:54,690 can brainstorm lists of different things to try, like these. 234 00:07:55,170 --> 00:07:56,560 By the way, I've actually 235 00:07:56,790 --> 00:07:58,480 worked on the spam 236 00:07:58,900 --> 00:08:00,000 problem myself for a while. 237 00:08:00,650 --> 00:08:01,610 And I actually spent quite some time on it. 238 00:08:01,770 --> 00:08:03,040 And even though I kind 239 00:08:03,360 --> 00:08:04,350 of understand the spam problem, 240 00:08:04,820 --> 00:08:05,820 I actually know a bit about it, 241 00:08:06,470 --> 00:08:07,380 I would actually have a very 242 00:08:07,600 --> 00:08:09,160 hard time telling you of 243 00:08:09,290 --> 00:08:10,790 these four options which is 244 00:08:10,980 --> 00:08:12,190 the best use of your time 245 00:08:12,670 --> 00:08:14,180 so what happens, frankly what 246 00:08:14,320 --> 00:08:15,790 happens far too often is 247 00:08:16,040 --> 00:08:17,240 that a research group or 248 00:08:17,350 --> 00:08:20,330 product group will randomly fixate on one of these options. 249 00:08:21,290 --> 00:08:22,870 And sometimes that turns 250 00:08:23,250 --> 00:08:24,350 out not to be the most 251 00:08:24,580 --> 00:08:25,610 fruitful way to spend your 252 00:08:25,740 --> 00:08:27,700 time depending, you know, on which 253 00:08:27,900 --> 00:08:30,400 of these options someone ends up randomly fixating on. 254 00:08:31,350 --> 00:08:32,670 By the way, in fact, if 255 00:08:32,800 --> 00:08:33,780 you even get to the stage 256 00:08:34,150 --> 00:08:35,710 where you brainstorm a list 257 00:08:35,900 --> 00:08:37,100 of different options to try, you're 258 00:08:37,250 --> 00:08:38,740 probably already ahead of the curve. 259 00:08:39,390 --> 00:08:41,190 Sadly, what most people do is 260 00:08:41,420 --> 00:08:42,160 instead of trying to list 261 00:08:42,230 --> 00:08:43,010 out the options of things 262 00:08:43,240 --> 00:08:44,510 you might try, what far too 263 00:08:44,810 --> 00:08:46,100 many people do is wake up 264 00:08:46,210 --> 00:08:47,380 one morning and, for some 265 00:08:47,580 --> 00:08:48,850 reason, just, you know, have a weird 266 00:08:49,110 --> 00:08:50,440 gut feeling that, "Oh let's 267 00:08:51,290 --> 00:08:52,670 have a huge honeypot project 268 00:08:53,190 --> 00:08:54,570 to go and collect tons more data" 269 00:08:55,320 --> 00:08:56,860 and for whatever strange reason just 270 00:08:57,570 --> 00:08:58,540 sort of wake up one morning and randomly 271 00:08:59,050 --> 00:09:00,330 fixate on one thing and just 272 00:09:00,540 --> 00:09:02,340 work on that for six months. 273 00:09:03,520 --> 00:09:04,170 But I think we can do better. 274 00:09:04,760 --> 00:09:06,130 And in particular what I'd 275 00:09:06,270 --> 00:09:07,130 like to do in the next 276 00:09:07,310 --> 00:09:08,410 video is tell you about 277 00:09:08,680 --> 00:09:09,890 the concept of error analysis 278 00:09:11,160 --> 00:09:12,530 and talk about the way 279 00:09:13,270 --> 00:09:15,150 where you can try 280 00:09:15,360 --> 00:09:16,830 to have a more systematic way 281 00:09:17,360 --> 00:09:18,640 to choose amongst the options 282 00:09:18,960 --> 00:09:19,950 of the many different things you 283 00:09:20,010 --> 00:09:21,730 might work, and therefore be 284 00:09:21,860 --> 00:09:23,430 more likely to select what 285 00:09:23,640 --> 00:09:24,820 is actually a good way 286 00:09:25,070 --> 00:09:26,070 to spend your time, you know 287 00:09:26,200 --> 00:09:28,920 for the next few weeks, or next few days or the next few months.