1 00:00:00,000 --> 00:00:04,560 You've now learned so much about deep learning, and sequence models, 2 00:00:04,560 --> 00:00:09,120 that we can actually describe a trigger word system quite simply, 3 00:00:09,120 --> 00:00:11,430 just on one slide as you see in this video. 4 00:00:11,430 --> 00:00:13,590 But, with the rise of speech recognition, 5 00:00:13,590 --> 00:00:15,120 there have been more and more devices. 6 00:00:15,120 --> 00:00:16,755 You can wake up with your voice, 7 00:00:16,755 --> 00:00:20,085 and those are sometimes called trigger word detection systems. 8 00:00:20,085 --> 00:00:23,265 So, let's see how you can build a trigger word system. 9 00:00:23,265 --> 00:00:26,355 Examples of trigger word systems include; 10 00:00:26,355 --> 00:00:29,925 Amazon Echo, which is broken out with the word Alexa, 11 00:00:29,925 --> 00:00:32,429 the Baidu DuerOS power devices, 12 00:00:32,429 --> 00:00:35,130 broken out with the phrase Xiadunihao, 13 00:00:35,130 --> 00:00:37,470 Apple Siri broken up with Hey Siri, 14 00:00:37,470 --> 00:00:40,350 and Google Home broken up with Okay Google. 15 00:00:40,350 --> 00:00:42,700 So, [inaudible] trigger word detection, 16 00:00:42,700 --> 00:00:46,400 that if you have say an Amazon echo in your living room, 17 00:00:46,400 --> 00:00:48,470 you can walk through the living room and just say, 18 00:00:48,470 --> 00:00:50,045 "Alexa, what time is it", 19 00:00:50,045 --> 00:00:51,155 and have it wake up, 20 00:00:51,155 --> 00:00:53,090 or be triggered by the voice Alexa, 21 00:00:53,090 --> 00:00:55,890 and answer your voice query. 22 00:00:55,890 --> 00:00:59,050 So, if you can build a trigger word detection system; 23 00:00:59,050 --> 00:01:04,610 maybe you can make your computer do something by telling your computer, "activate." 24 00:01:04,610 --> 00:01:10,070 One of my friends also works on turning on and off a particular lamp, 25 00:01:10,070 --> 00:01:11,585 using a trigger word, 26 00:01:11,585 --> 00:01:13,310 as a fun project. 27 00:01:13,310 --> 00:01:18,020 But, what I want to show you is how you can build a trigger word detection system. 28 00:01:18,020 --> 00:01:21,620 The literature on triggered detection algorithm is still evolving, 29 00:01:21,620 --> 00:01:23,930 so there isn't wide consensus yet, 30 00:01:23,930 --> 00:01:26,750 on what's the best algorithm for trigger word detection. 31 00:01:26,750 --> 00:01:30,665 So, I'm just going to show you one example of an algorithm you can use. 32 00:01:30,665 --> 00:01:33,440 Now, you've seen RNNs like this, 33 00:01:33,440 --> 00:01:34,520 and what we really do, 34 00:01:34,520 --> 00:01:35,945 is to take an audio clip, 35 00:01:35,945 --> 00:01:38,210 maybe compute spectrogram features, 36 00:01:38,210 --> 00:01:42,065 and that generates audio features X-1 X-2 X-3, 37 00:01:42,065 --> 00:01:45,195 that you pass through an RNN. 38 00:01:45,195 --> 00:01:47,715 So, all that remains to be done, 39 00:01:47,715 --> 00:01:51,320 is to define the target labels y. 40 00:01:51,320 --> 00:01:55,955 So, this point in the audio clip, 41 00:01:55,955 --> 00:01:59,235 is when someone just finished saying the trigger word, 42 00:01:59,235 --> 00:02:01,470 such as Alexa or Xiadunihao, 43 00:02:01,470 --> 00:02:03,465 or Hey Siri, or okay Google. 44 00:02:03,465 --> 00:02:04,780 Then, in the training set, 45 00:02:04,780 --> 00:02:09,665 you can set the target labels to be zero for everything before that point, 46 00:02:09,665 --> 00:02:11,015 and right after that, 47 00:02:11,015 --> 00:02:13,975 to set the target label of one. 48 00:02:13,975 --> 00:02:17,100 Then, if a little bit later on, 49 00:02:17,100 --> 00:02:22,405 the trigger word was said again at this point, 50 00:02:22,405 --> 00:02:28,165 then you can again set the target label to be one, right after that. 51 00:02:28,165 --> 00:02:34,580 Now, this type of labeling scheme for an RNN could work. 52 00:02:34,580 --> 00:02:37,595 Actually it just won't actually work reasonably well. 53 00:02:37,595 --> 00:02:40,460 One slight disadvantage of this is, 54 00:02:40,460 --> 00:02:43,040 it creates a very imbalanced training set, 55 00:02:43,040 --> 00:02:45,155 so we have a lot more zeros than we want. 56 00:02:45,155 --> 00:02:47,645 So, one other thing you could do, 57 00:02:47,645 --> 00:02:49,370 that it's little bit of a hack, 58 00:02:49,370 --> 00:02:52,879 but could make the model a little bit easier to train, 59 00:02:52,879 --> 00:02:56,780 is instead of setting only a single time step to operate one, 60 00:02:56,780 --> 00:03:00,780 you could actually make it to operate a few ones for several times. 61 00:03:00,780 --> 00:03:02,250 So, for a fixed period of time, 62 00:03:02,250 --> 00:03:04,590 before reverting back to zero. 63 00:03:04,590 --> 00:03:14,890 So that slightly evens out the ratio of one's to zero's, 64 00:03:14,890 --> 00:03:16,975 but there's this little bit of a hack. 65 00:03:16,975 --> 00:03:22,250 But, if this is when in the audio clip, 66 00:03:22,250 --> 00:03:24,735 the trigger word was said, then right after that, 67 00:03:24,735 --> 00:03:26,520 you can set the target label to one, 68 00:03:26,520 --> 00:03:28,845 and if this is the trigger word said again, 69 00:03:28,845 --> 00:03:29,885 then right after that, 70 00:03:29,885 --> 00:03:35,540 is when you want the RNN to output one. 71 00:03:35,540 --> 00:03:38,310 So, you get to play more off this as well, 72 00:03:38,310 --> 00:03:40,215 in the programming exercise, 73 00:03:40,215 --> 00:03:44,110 but I think you should feel quite proud of yourself. 74 00:03:44,110 --> 00:03:45,850 We've learned enough about deep learning, 75 00:03:45,850 --> 00:03:47,800 that it just takes one picture; 76 00:03:47,800 --> 00:03:53,200 one slide, to describe something as complicated as trigger word detection. 77 00:03:53,200 --> 00:03:57,795 Based on this, I hope you'll implement something that works, 78 00:03:57,795 --> 00:04:00,515 and allows you to detect trigger words. 79 00:04:00,515 --> 00:04:04,435 Well, you see more of this in the programming exercise. 80 00:04:04,435 --> 00:04:06,870 So, that's it for trigger words. 81 00:04:06,870 --> 00:04:09,110 I hope you feel quite proud of yourself, 82 00:04:09,110 --> 00:04:11,045 for how much you've learned about deep learning, 83 00:04:11,045 --> 00:04:15,650 that you can now describe trigger words in just one slide in a few minutes, 84 00:04:15,650 --> 00:04:19,115 and that you prepare hopefully to implemented it, and get it to work. 85 00:04:19,115 --> 00:04:21,800 Maybe even make it do something fun in your house even, 86 00:04:21,800 --> 00:04:23,000 like turn on or turn off, 87 00:04:23,000 --> 00:04:24,770 or you could do something on your computer, 88 00:04:24,770 --> 00:04:28,080 when you or when someone else says the trigger words. 89 00:04:28,080 --> 00:04:31,475 This is the last technical video of this course, 90 00:04:31,475 --> 00:04:35,140 and to wrap up in discourse on sequence models, 91 00:04:35,140 --> 00:04:36,695 you learned about RNNs, 92 00:04:36,695 --> 00:04:40,025 including both GR use and LSTMs. 93 00:04:40,025 --> 00:04:41,840 Then in the second week, 94 00:04:41,840 --> 00:04:44,085 you've learned a lot about word embeddings, 95 00:04:44,085 --> 00:04:46,545 and also learned representations of words. 96 00:04:46,545 --> 00:04:48,370 Then in this week, 97 00:04:48,370 --> 00:04:51,320 you learned about the attention model, 98 00:04:51,320 --> 00:04:54,710 as well as how to use it to process audio theater. 99 00:04:54,710 --> 00:05:00,380 I hope you have fun implementing all of these ideas in this peace-pro exercise. 100 00:05:00,380 --> 00:05:02,950 Let's go on to the last video.