1 00:00:00,000 --> 00:00:03,398 Hello, and welcome to this final week of this course, 2 00:00:03,398 --> 00:00:05,805 as well as to the final week of this sequence of 3 00:00:05,805 --> 00:00:09,215 five courses in the deep learning specialization. 4 00:00:09,215 --> 00:00:11,430 You're nearly at the finish line. 5 00:00:11,430 --> 00:00:14,875 In this week, you hear about sequence-to-sequence models, 6 00:00:14,875 --> 00:00:19,590 which are useful for everything from machine translation to speech recognition. 7 00:00:19,590 --> 00:00:22,710 Let's start with the basic models and then later this week you, 8 00:00:22,710 --> 00:00:23,738 hear about beam search, 9 00:00:23,738 --> 00:00:25,847 the attention model, and we'll wrap up 10 00:00:25,847 --> 00:00:29,540 the discussion of models for audio data, like speech. 11 00:00:29,540 --> 00:00:32,160 Let's get started. 12 00:00:32,160 --> 00:00:38,770 Let's say you want to input a French sentence like Jane visite l'Afrique en septembre, 13 00:00:38,770 --> 00:00:41,620 and you want to translate it to the English sentence, 14 00:00:41,620 --> 00:00:44,160 Jane is visiting Africa in September. 15 00:00:44,160 --> 00:00:48,380 As usual, let's use x<1> through x, 16 00:00:48,380 --> 00:00:49,655 in this case <5>, 17 00:00:49,655 --> 00:00:52,445 to represent the words in the input sequence, 18 00:00:52,445 --> 00:00:58,460 and we'll use y<1> through y<6> to represent the words in the output sequence. 19 00:00:58,460 --> 00:01:05,895 So, how can you train a new network to input the sequence x and output the sequence y? 20 00:01:05,895 --> 00:01:08,010 Well, here's something you could do, 21 00:01:08,010 --> 00:01:14,630 and the ideas I'm about to present are mainly from these two papers due to Sutskever, 22 00:01:14,630 --> 00:01:16,580 Oriol Vinyals, and Quoc Le, 23 00:01:16,580 --> 00:01:23,023 and that one by Kyunghyun Cho, 24 00:01:23,023 --> 00:01:26,023 Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, 25 00:01:26,023 --> 00:01:29,023 Holger Schwen, and Yoshua Bengio. 26 00:01:29,023 --> 00:01:31,875 First, let's have a network, 27 00:01:31,875 --> 00:01:38,060 which we're going to call the encoder network be built as a RNN, 28 00:01:38,060 --> 00:01:41,473 and this could be a GRU and LSTM, 29 00:01:41,473 --> 00:01:44,945 feed in the input French words one word at a time. 30 00:01:44,945 --> 00:01:49,135 And after ingesting the input sequence, 31 00:01:49,135 --> 00:01:55,770 the RNN then offers a vector that represents the input sentence. 32 00:01:55,770 --> 00:02:00,755 After that, you can build a decoder network which I'm going to draw here, 33 00:02:00,755 --> 00:02:02,525 which takes as input 34 00:02:02,525 --> 00:02:07,220 the encoding output by the encoder network shown in black on the left, 35 00:02:07,220 --> 00:02:10,900 and then can be trained to output 36 00:02:10,900 --> 00:02:15,720 the translation one word 37 00:02:15,720 --> 00:02:22,300 at a time until eventually it outputs say, 38 00:02:22,300 --> 00:02:25,783 the end of sequence or end the sentence token upon 39 00:02:25,783 --> 00:02:29,900 which the decoder stops and as usual we could take 40 00:02:29,900 --> 00:02:34,460 the generated tokens and feed them to the next [inaudible] 41 00:02:34,460 --> 00:02:39,640 in the sequence like we 're doing before when synthesizing text using the language model. 42 00:02:39,640 --> 00:02:45,915 One of the most remarkable recent results in deep learning is that this model works, 43 00:02:45,915 --> 00:02:49,725 given enough pairs of French and English sentences. 44 00:02:49,725 --> 00:02:52,815 If you train the model to input 45 00:02:52,815 --> 00:02:57,144 a French sentence and output the corresponding English translation, 46 00:02:57,144 --> 00:02:59,695 this will actually work decently well. 47 00:02:59,695 --> 00:03:03,270 And this model simply uses an encoder network, 48 00:03:03,270 --> 00:03:08,400 whose job it is to find an encoding of the input French sentence and then use 49 00:03:08,400 --> 00:03:13,860 a decoder network to then generate the corresponding English translation. 50 00:03:13,860 --> 00:03:18,630 An architecture very similar to this also works 51 00:03:18,630 --> 00:03:24,060 for image captioning so given an image like the one shown here, 52 00:03:24,060 --> 00:03:29,260 maybe wanted to be captioned automatically as a cat sitting on a chair. 53 00:03:29,260 --> 00:03:32,790 So how do you train a new network to input an image and 54 00:03:32,790 --> 00:03:38,405 output a caption like that phrase up there? 55 00:03:38,405 --> 00:03:43,715 Here's what you can do. From the earlier course on [inaudible] you've seen 56 00:03:43,715 --> 00:03:48,453 how you can input an image into a convolutional network, 57 00:03:48,453 --> 00:03:50,555 maybe a pre-trained AlexNet, 58 00:03:50,555 --> 00:03:56,360 and have that learn an encoding or learn a set of features of the input image. 59 00:03:56,360 --> 00:03:57,440 So, this is actually 60 00:03:57,440 --> 00:04:04,295 the AlexNet architecture and if we get rid of this final Softmax unit, 61 00:04:04,295 --> 00:04:06,770 the pre-trained AlexNet can give you 62 00:04:06,770 --> 00:04:12,705 a 4096-dimensional feature vector of which to represent this picture of a cat. 63 00:04:12,705 --> 00:04:18,760 And so this pre-trained network can be the encoder network for the image 64 00:04:18,760 --> 00:04:25,120 and you now have a 4096-dimensional vector that represents the image. 65 00:04:25,120 --> 00:04:28,870 You can then take this and feed it to an RNN, 66 00:04:28,870 --> 00:04:36,878 whose job it is to generate the caption one word at a time. 67 00:04:36,878 --> 00:04:43,985 So similar to what we saw with machine translation translating from French to English, 68 00:04:43,985 --> 00:04:50,590 you can now input a feature vector describing the input and then have it 69 00:04:50,590 --> 00:04:58,180 generate an output sequence or output set of words one word at a time. 70 00:04:58,180 --> 00:05:01,941 And this actually works pretty well for image captioning, 71 00:05:01,941 --> 00:05:04,610 especially if the caption you want to generate is not too long. 72 00:05:04,610 --> 00:05:07,730 As far as I know, 73 00:05:07,730 --> 00:05:10,115 this type of model 74 00:05:10,115 --> 00:05:12,731 was first proposed by Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, 75 00:05:12,731 --> 00:05:18,185 and Alan Yuille, although it turns out there were 76 00:05:18,185 --> 00:05:20,030 multiple groups coming up with 77 00:05:20,030 --> 00:05:25,080 very similar models independently and at about the same time. 78 00:05:25,080 --> 00:05:29,560 So two other groups that had done very similar work 79 00:05:29,560 --> 00:05:33,260 at about the same time and I think independently of Mao et al were Oriol Vinyals, 80 00:05:33,260 --> 00:05:35,025 Alexander Toshev, Samy Bengio, 81 00:05:35,025 --> 00:05:38,750 and Dumitru Erhan, as well as Andrej Karpathy and Fei-Fei Yi. 82 00:05:38,750 --> 00:05:45,040 So, you've now seen how a basic sequence-to-sequence model works, 83 00:05:45,040 --> 00:05:49,340 or how a basic image-to-sequence or image captioning model works, 84 00:05:49,340 --> 00:05:53,460 but there are some differences between how you would run a model like this, 85 00:05:53,460 --> 00:05:56,550 so generating a sequence compared to how you were 86 00:05:56,550 --> 00:06:00,495 synthesizing novel text using a language model. 87 00:06:00,495 --> 00:06:02,050 One of the key differences is, 88 00:06:02,050 --> 00:06:04,887 you don't want a randomly chosen translation, 89 00:06:04,887 --> 00:06:07,626 you maybe want the most likely translation, 90 00:06:07,626 --> 00:06:10,345 or you don't want a randomly chosen caption, maybe not, 91 00:06:10,345 --> 00:06:13,870 but you might want the best caption and most likely caption. 92 00:06:13,870 --> 00:06:18,000 So let's see in the next video how you go about generating that.