1 00:00:00,270 --> 00:00:04,351 You have learned a lot about ConvNets, everything ranging from 2 00:00:04,351 --> 00:00:08,888 the architecture of the ConvNet to how to use it for image recognition, 3 00:00:08,888 --> 00:00:13,590 to object detection, to face recognition and neural-style transfer. 4 00:00:13,590 --> 00:00:17,626 And even though most of the discussion has focused on images, 5 00:00:17,626 --> 00:00:21,205 on sort of 2D data, because images are so pervasive. 6 00:00:21,205 --> 00:00:26,135 It turns out that many of the ideas you've learned about also apply, 7 00:00:26,135 --> 00:00:30,640 not just to 2D images but also to 1D data as well as to 3D data. 8 00:00:30,640 --> 00:00:33,048 Let's take a look. 9 00:00:33,048 --> 00:00:38,506 In the first week of this course, you learned about the 2D convolution, 10 00:00:38,506 --> 00:00:44,340 where you might input a 14 x 14 image and convolve that with a 5 x 5 filter. 11 00:00:44,340 --> 00:00:49,097 And you saw how 14 x 14 convolved with 5 x 5, 12 00:00:49,097 --> 00:00:52,590 this gives you a 10 x 10 output. 13 00:00:52,590 --> 00:00:58,662 And if you have multiple channels, maybe those 14 x 14 x 3, 14 00:00:58,662 --> 00:01:03,170 then it would be 5 x 5 that matches the same 3. 15 00:01:03,170 --> 00:01:08,460 And then if you have multiple filters, say 16 filters, you end up with 10 x 10 x 16. 16 00:01:08,460 --> 00:01:14,430 It turns out that a similar idea can be applied to 1D data as well. 17 00:01:14,430 --> 00:01:21,328 For example, on the left is an EKG signal, also called an electrocardioagram. 18 00:01:21,328 --> 00:01:25,577 Basically if you place an electrode over your chest, this measures 19 00:01:25,577 --> 00:01:29,910 the little voltages that vary across your chest as your heart beats. 20 00:01:29,910 --> 00:01:34,562 Because the little electric waves generated by your heart's beating can be 21 00:01:34,562 --> 00:01:36,823 measured with a pair of electrodes. 22 00:01:36,823 --> 00:01:40,490 And so this is an EKG of someone's heart beating. 23 00:01:40,490 --> 00:01:45,930 And so each of these peaks corresponds to one heartbeat. 24 00:01:45,930 --> 00:01:49,970 So if you want to use EKG signals to make medical diagnoses, for 25 00:01:49,970 --> 00:01:55,062 example, then you would have 1D data because what EKG data is, 26 00:01:55,062 --> 00:02:01,610 is it's a time series showing the voltage at each instant in time. 27 00:02:01,610 --> 00:02:04,500 So rather than a 14 x 14 dimensional input, 28 00:02:04,500 --> 00:02:08,160 maybe you just have a 14 dimensional input. 29 00:02:08,160 --> 00:02:11,770 And in that case, you might want to convolve this with a 1 dimensional filter. 30 00:02:11,770 --> 00:02:16,420 So rather than the 5 by 5, you just have 5 dimensional filter. 31 00:02:16,420 --> 00:02:21,481 So with 2D data what a convolution will allow you to do was to take the same 5 x 5 32 00:02:21,481 --> 00:02:26,950 feature detector and apply it across at different positions throughout the image. 33 00:02:26,950 --> 00:02:31,110 And that's how you wound up with your 10 x 10 output. 34 00:02:31,110 --> 00:02:36,258 What a 1D filter allows you to do is take your 5 dimensional filter and 35 00:02:36,258 --> 00:02:42,860 similarly apply that in lots of different positions throughout this 1D signal. 36 00:02:42,860 --> 00:02:45,510 And so if you apply this convolution, 37 00:02:45,510 --> 00:02:50,270 what you find is that a 14 dimensional thing convolved with 38 00:02:50,270 --> 00:02:55,370 this 5 dimensional thing, this would give you a 10 dimensional output. 39 00:02:55,370 --> 00:03:00,496 And again, if you have multiple channels, you might have in this case you 40 00:03:00,496 --> 00:03:06,381 can use just 1 channel, if you have 1 lead or 1 electrode for EKG, so times 5 x 1. 41 00:03:06,381 --> 00:03:12,468 And if you have 16 filters, maybe end up with 10 x 16 over there, 42 00:03:12,468 --> 00:03:16,300 and this could be one layer of your ConvNet. 43 00:03:16,300 --> 00:03:20,257 And then for the next layer of your ConvNet, if you input a 10 x 16 44 00:03:20,257 --> 00:03:25,560 dimensional input and you might convolve that with a 5 dimensional filter again. 45 00:03:25,560 --> 00:03:29,583 Then these have 16 channels, so that has a match. 46 00:03:29,583 --> 00:03:34,585 And we have 32 filters, then the output of another layer 47 00:03:34,585 --> 00:03:39,190 would be 6 x 32, if you have 32 filters, right? 48 00:03:39,190 --> 00:03:42,268 And the analogy to the the 2D data, 49 00:03:42,268 --> 00:03:46,779 this is similar to all of the 10 x 10 x 16 data and 50 00:03:46,779 --> 00:03:51,860 convolve it with a 5 x 5 x 16, and that has to match. 51 00:03:51,860 --> 00:03:54,568 That will give you a 6 by 6 dimensional output, 52 00:03:54,568 --> 00:03:58,080 and you have 32 filters, that's where the 32 comes from. 53 00:03:58,080 --> 00:04:03,567 So all of these ideas apply also to 1D data, where you can have the same 54 00:04:03,567 --> 00:04:08,884 feature detector, such as this, apply to a variety of positions. 55 00:04:08,884 --> 00:04:13,430 For example, to detect the different heartbeats in an EKG signal. 56 00:04:13,430 --> 00:04:18,505 But to use the same set of features to detect the heartbeats even at different 57 00:04:18,505 --> 00:04:23,836 positions along these time series, and so ConvNet can be used even on 1D data. 58 00:04:23,836 --> 00:04:28,501 For along with 1D data applications, you actually use a recurrent neural network, 59 00:04:28,501 --> 00:04:30,790 which you learn about in the next course. 60 00:04:30,790 --> 00:04:36,520 But some people can also try using ConvNets in these problems. 61 00:04:36,520 --> 00:04:39,990 And in the next course on sequence models, which we will talk about 62 00:04:39,990 --> 00:04:43,310 recurring neural networks and LCM and other models like that. 63 00:04:43,310 --> 00:04:47,545 We'll talk about the pros and cons of using 1D ConvNets versus some of those 64 00:04:47,545 --> 00:04:51,070 other models that are explicitly designed to sequenced data. 65 00:04:51,070 --> 00:04:54,290 So that's the generalization from 2D to 1D. 66 00:04:54,290 --> 00:04:56,510 How about 3D data? 67 00:04:56,510 --> 00:04:58,900 Well, what is three dimensional data? 68 00:04:58,900 --> 00:05:04,720 It is that, instead of having a 1D list of numbers or a 2D matrix of numbers, 69 00:05:04,720 --> 00:05:11,060 you now have a 3D block, a three dimensional input volume of numbers. 70 00:05:11,060 --> 00:05:15,123 So here's the example of that which is if you take a CT scan, 71 00:05:15,123 --> 00:05:20,510 this is a type of X-ray scan that gives a three dimensional model of your body. 72 00:05:20,510 --> 00:05:24,746 But what a CT scan does is it takes different slices through your body. 73 00:05:24,746 --> 00:05:28,465 So as you scan through a CT scan which I'm doing here, 74 00:05:28,465 --> 00:05:33,507 you can look at different slices of the human torso to see how they look and 75 00:05:33,507 --> 00:05:37,090 so this data is fundamentally three dimensional. 76 00:05:37,090 --> 00:05:43,039 And one way to think of this data is if your data now has some height, 77 00:05:43,039 --> 00:05:46,558 some width, and then also some depth. 78 00:05:46,558 --> 00:05:50,359 Where this is the different slices through this volume, 79 00:05:50,359 --> 00:05:53,840 are the different slices through the torso. 80 00:05:53,840 --> 00:05:57,660 So if you want to apply a ConvNet to detect features in this 81 00:05:57,660 --> 00:06:02,470 three dimensional CAT scan or CT scan, then you can generalize the ideas from 82 00:06:02,470 --> 00:06:07,020 the first slide to three dimensional convolutions as well. 83 00:06:07,020 --> 00:06:10,356 So if you have a 3D volume, and for 84 00:06:10,356 --> 00:06:15,764 the sake of simplicity let's say is 14 x 14 x 14 and 85 00:06:15,764 --> 00:06:21,770 so this is the height, width, and depth of the input CT scan. 86 00:06:21,770 --> 00:06:25,735 And again, just like images they'll all have to be square, 87 00:06:25,735 --> 00:06:29,450 a 3D volume doesn't have to be a perfect cube as well. 88 00:06:29,450 --> 00:06:32,210 So the height and width of a image can be different, and 89 00:06:32,210 --> 00:06:36,118 in the same way the height and width and the depth of a CT scan can be different. 90 00:06:36,118 --> 00:06:40,560 But I'm just using 14 x 14 x 14 here to simplify the discussion. 91 00:06:40,560 --> 00:06:45,849 And if you convolve this with a now a 5 x 5 x 5 filter, 92 00:06:45,849 --> 00:06:50,788 so you're filters now are also three dimensional 93 00:06:50,788 --> 00:06:55,863 then this would give you a 10 x 10 x 10 volume. 94 00:06:55,863 --> 00:07:01,366 And technically, you could also have by 1, if this is the number of channels. 95 00:07:01,366 --> 00:07:06,715 So this is just a 3D volume, but your data can also have different 96 00:07:06,715 --> 00:07:11,489 numbers of channels, then this would be times 1 as well. 97 00:07:11,489 --> 00:07:17,472 Because the number of channels here and the number of channels here has to match. 98 00:07:17,472 --> 00:07:22,371 And then if you have 16 filters did a 5 x 5 x 5 x 1 then the next output 99 00:07:22,371 --> 00:07:24,790 will be a 10 x 10 x 10 x 16. 100 00:07:24,790 --> 00:07:30,129 So this could be one layer of your ConvNet over 3D data, and if the next 101 00:07:30,129 --> 00:07:36,660 layer of the ConvNet convolves this again with a 5 x 5 x 5 x 16 dimensional filter. 102 00:07:36,660 --> 00:07:40,666 So this number of channels has to match data as usual, and 103 00:07:40,666 --> 00:07:46,190 if you have 32 filters then similar to what you saw was ConvNet of the images. 104 00:07:46,190 --> 00:07:54,350 Now you'll end up with a 6 x 6 x 6 volume across 32 channels. 105 00:07:54,350 --> 00:07:57,992 So 3D data can also be learned on, 106 00:07:57,992 --> 00:08:02,020 sort of directly using a three dimensional ConvNet. 107 00:08:02,020 --> 00:08:07,500 And what these filters do is really detect features across your 3D data, 108 00:08:08,730 --> 00:08:13,180 CAT scans, medical scans as one example of 3D volumes. 109 00:08:13,180 --> 00:08:18,450 But another example of data, you could treat as a 3D volume would be movie data, 110 00:08:18,450 --> 00:08:23,410 where the different slices could be different slices in time through a movie. 111 00:08:23,410 --> 00:08:28,171 And you could use this to detect motion or people taking actions in movies. 112 00:08:28,171 --> 00:08:31,868 So that's it on generalization of ConvNets from 113 00:08:31,868 --> 00:08:35,520 2D data to also 1D as well as 3D data. 114 00:08:35,520 --> 00:08:40,395 Image data is so pervasive that the vast majority of ConvNets are on 2D data, 115 00:08:40,395 --> 00:08:45,420 on image data, but I hope that these other models will be helpful to you as well. 116 00:08:45,420 --> 00:08:48,588 So this is it, this is the last video of this week and 117 00:08:48,588 --> 00:08:51,570 the last video of this course on ConvNets. 118 00:08:51,570 --> 00:08:53,810 You've learned a lot about ConvNets and 119 00:08:53,810 --> 00:08:58,380 I hope you find many of these ideas useful for your future work. 120 00:08:58,380 --> 00:09:01,600 So congratulations on finishing these videos. 121 00:09:01,600 --> 00:09:04,150 I hope you enjoyed this week's exercise and 122 00:09:04,150 --> 00:09:07,850 I look forward also to seeing you in the next course on sequence models.