1 00:00:02,550 --> 00:00:07,476 Another useful way you can introduce hints is by the way you train the model. 2 00:00:07,476 --> 00:00:10,258 Say, you have a problem of image classification, 3 00:00:10,258 --> 00:00:12,700 but this time it's kind of near impossible. 4 00:00:12,700 --> 00:00:15,335 You only have 5,000 images, labeled ones, 5 00:00:15,335 --> 00:00:17,740 and you want to take an image of a person and 6 00:00:17,740 --> 00:00:20,580 use it to predict the style of his clothing. 7 00:00:20,580 --> 00:00:22,090 So, smart, formal whatever. 8 00:00:22,090 --> 00:00:25,075 In this case, it's really hard to obtain 9 00:00:25,075 --> 00:00:27,100 labelled images because you would have to 10 00:00:27,100 --> 00:00:30,285 label them all by hand or hire people to do this for you. 11 00:00:30,285 --> 00:00:32,910 The problem here is that if you fit a new letter, 12 00:00:32,910 --> 00:00:34,415 you actually have two choices. 13 00:00:34,415 --> 00:00:36,372 Either you use a normal-sized network, 14 00:00:36,372 --> 00:00:38,655 but since there is so little amount of data, 15 00:00:38,655 --> 00:00:41,810 it will probably over fit before it ever fits to anything. 16 00:00:41,810 --> 00:00:44,625 Another way is to use a highly over-regularized network domain 17 00:00:44,625 --> 00:00:47,145 like one layer and like two neurons. 18 00:00:47,145 --> 00:00:50,005 But in this case, it won't be able to learn anything better than a linear model, 19 00:00:50,005 --> 00:00:51,850 it's that small, remember. 20 00:00:51,850 --> 00:00:55,145 Fortunately, deep learning allows you to kill both birds with one stone. 21 00:00:55,145 --> 00:00:56,500 In this case, you can build 22 00:00:56,500 --> 00:00:59,995 a large network which would nevertheless not overtrain too much. 23 00:00:59,995 --> 00:01:05,130 We can do so by introducing other problems that it should also solve. 24 00:01:05,130 --> 00:01:08,170 Let's say that this clothing separation problem 25 00:01:08,170 --> 00:01:11,295 is too hard to obtain data on but we have a different problem. 26 00:01:11,295 --> 00:01:13,180 Let's say we have, for the same data, 27 00:01:13,180 --> 00:01:16,040 we have age and gender labeled. 28 00:01:16,040 --> 00:01:17,890 This is much simpler since you can probably 29 00:01:17,890 --> 00:01:19,780 extract a lot of such information from social networks. 30 00:01:19,780 --> 00:01:21,955 You can just parse everything you've got. 31 00:01:21,955 --> 00:01:24,100 It might be slightly illegal but let's assume you 32 00:01:24,100 --> 00:01:27,375 work in the social network company and you own this data. 33 00:01:27,375 --> 00:01:30,035 In this case, what you want to do, 34 00:01:30,035 --> 00:01:34,160 you want to train your network to predict both style and those age, 35 00:01:34,160 --> 00:01:36,470 gender, those kinds of features. 36 00:01:36,470 --> 00:01:39,120 And you want to do this the following way. 37 00:01:39,120 --> 00:01:41,060 So, you feed the portrait, 38 00:01:41,060 --> 00:01:44,350 the photo through the first layer and then the second one. 39 00:01:44,350 --> 00:01:46,925 And then, there is a split after which each kind of hand 40 00:01:46,925 --> 00:01:50,295 of your network is used to predict a different class. 41 00:01:50,295 --> 00:01:52,895 Now, this architecture express a very powerful idea. 42 00:01:52,895 --> 00:01:56,780 You say that you want your first two dense layers not only learn features useful for 43 00:01:56,780 --> 00:02:01,610 predicting the style of clothes but also useful in determining person's age or gender. 44 00:02:01,610 --> 00:02:04,355 Now, this is very useful because the second domain 45 00:02:04,355 --> 00:02:07,930 actually contains much more images and it's hard to overfit on it. 46 00:02:07,930 --> 00:02:11,400 So in this case, you'll learn all the features that are useful for both worlds 47 00:02:11,400 --> 00:02:15,585 but they won't to be able to overfit your problem before they ever fit. 48 00:02:15,585 --> 00:02:18,610 That is actually the features that suit this definition perfectly. 49 00:02:18,610 --> 00:02:21,055 For example, if you are working with the wrong image pixels, 50 00:02:21,055 --> 00:02:25,115 it kind of makes sense that before trying to determine style or age or gender, 51 00:02:25,115 --> 00:02:28,735 it is important to find where a person's head is, at least. 52 00:02:28,735 --> 00:02:32,670 In the first case, if you want to determine person's clothing style, 53 00:02:32,670 --> 00:02:35,425 it makes sense because you'll then be able to find whether 54 00:02:35,425 --> 00:02:39,130 he has some kind of jewelry which is indicative of the style. 55 00:02:39,130 --> 00:02:41,300 And the second problem, when will the face is 56 00:02:41,300 --> 00:02:43,700 helps you determine whether a person has a facial hair, 57 00:02:43,700 --> 00:02:45,710 for example, which is a telltale sign of a man, 58 00:02:45,710 --> 00:02:48,020 so far as I remember. 59 00:02:48,020 --> 00:02:51,490 And this actually means that you'll be able to train low-level features. 60 00:02:51,490 --> 00:02:54,415 A lot of low-level features for free with almost no error fittings 61 00:02:54,415 --> 00:02:58,360 long as your second domain is large enough in terms of data available. 62 00:02:58,360 --> 00:03:00,005 When training such a network, 63 00:03:00,005 --> 00:03:02,135 you'll have to feed it with mini-batches 64 00:03:02,135 --> 00:03:04,960 from both those problems in an automated fashion. 65 00:03:04,960 --> 00:03:07,220 For example, you could first sample a mini-batch 66 00:03:07,220 --> 00:03:09,785 of images from which you know age and gender, 67 00:03:09,785 --> 00:03:12,590 and train the network through its second head, 68 00:03:12,590 --> 00:03:14,720 then sample the second mini-batch which 69 00:03:14,720 --> 00:03:17,700 contains images for which you know the styling and train the first head. 70 00:03:17,700 --> 00:03:20,750 There's, of course, more than one way you can organize the training procedure. 71 00:03:20,750 --> 00:03:23,360 For example, you can take too many batches of 72 00:03:23,360 --> 00:03:26,630 data in the second domain for just one in the first one, 73 00:03:26,630 --> 00:03:30,290 or you can try to first tune your network to convergence on our age and 74 00:03:30,290 --> 00:03:34,690 gender prediction only then start fixing it on the first problem. 75 00:03:34,690 --> 00:03:36,305 We'll study this idea to 76 00:03:36,305 --> 00:03:40,285 a much greater extent in the following mood dedicated to computer vision. 77 00:03:40,285 --> 00:03:44,300 Now, the general idea is that regardless of which method you use, 78 00:03:44,300 --> 00:03:48,890 you still get this neat property that you can train your huge neural network to 79 00:03:48,890 --> 00:03:54,120 get reasonable features even though the original data is really scarce. 80 00:03:54,120 --> 00:03:57,390 So, I've just seen a few of those kind of words of power in deep learning. 81 00:03:57,390 --> 00:03:59,990 First one was about managing the level of abstraction. 82 00:03:59,990 --> 00:04:02,620 See, I want features that are much more abstract than they're 83 00:04:02,620 --> 00:04:06,680 all imaged pixels here so that I can use them together with other features. 84 00:04:06,680 --> 00:04:09,865 The second one was managing the amount of features so you can say, 85 00:04:09,865 --> 00:04:11,780 don't trust those guys too much, 86 00:04:11,780 --> 00:04:13,760 trust these guys instead. 87 00:04:13,760 --> 00:04:17,450 The final one was, could you please train features that are not only useful for 88 00:04:17,450 --> 00:04:22,025 my super small problem but also generally useful for similar problems? 89 00:04:22,025 --> 00:04:24,075 There's, of course, much more in deep learning. 90 00:04:24,075 --> 00:04:27,680 And here are some of the examples of other ideas you can 91 00:04:27,680 --> 00:04:32,595 incorporate in your neural network architecture that will appear later in our course. 92 00:04:32,595 --> 00:04:34,540 For example, if you want to solve 93 00:04:34,540 --> 00:04:38,365 an image classification problem so you want to tell cats from dogs, 94 00:04:38,365 --> 00:04:40,703 it makes a lot of sense to make the feature you 95 00:04:40,703 --> 00:04:43,870 learn invariant to the position of this cat or dog. 96 00:04:43,870 --> 00:04:47,570 Cats may appear in the middle or in the top right or bottom left corner. 97 00:04:47,570 --> 00:04:51,610 You want your model to be able to detect him regardless of where he is. 98 00:04:51,610 --> 00:04:56,735 There's also another way you can teach your neural network to be kind of robust, 99 00:04:56,735 --> 00:04:59,250 resilient to small shifts in data. 100 00:04:59,250 --> 00:05:01,410 If a cat slightly moves his paw, 101 00:05:01,410 --> 00:05:04,040 it doesn't mean it has stopped being a cat. 102 00:05:04,040 --> 00:05:06,055 For natural language applications, 103 00:05:06,055 --> 00:05:08,320 you'll learn about how do you teach 104 00:05:08,320 --> 00:05:11,345 your neural network to find the underlying cause of the data. 105 00:05:11,345 --> 00:05:14,465 Say, you have words and you want to classify the sentiment. 106 00:05:14,465 --> 00:05:17,185 Instead of working on the level of words, say, beg a force notation, 107 00:05:17,185 --> 00:05:20,250 you'll teach your neural networks to find a hidden structure, 108 00:05:20,250 --> 00:05:21,880 hidden process that generated those words, 109 00:05:21,880 --> 00:05:24,135 basically reverse-engineering human mind. 110 00:05:24,135 --> 00:05:27,770 There's also a way you can train your neural network to 111 00:05:27,770 --> 00:05:32,275 obtain some particular property of their presentations in the intermediate layers. 112 00:05:32,275 --> 00:05:35,180 For example, you may want your neural network to 113 00:05:35,180 --> 00:05:39,985 be robust in a way that it doesn't trust one single feature too much. 114 00:05:39,985 --> 00:05:44,690 Or you can try to adjust the scene representation to be sparse 115 00:05:44,690 --> 00:05:46,775 so you may train your neural networks so that 116 00:05:46,775 --> 00:05:50,215 almost all neurons eye zeros for any given object of data. 117 00:05:50,215 --> 00:05:51,970 Of course, there's much more to it. 118 00:05:51,970 --> 00:05:54,080 I just barely scratched the surface of this idea of deep learning being 119 00:05:54,080 --> 00:05:56,570 a language and as we'll go further, 120 00:05:56,570 --> 00:05:59,560 you'll study much more powerful tools to play with. 121 00:05:59,560 --> 00:06:03,705 Now, the key difference between deep learning and other machine learning methods, 122 00:06:03,705 --> 00:06:06,114 in my humble opinion, is that, well, 123 00:06:06,114 --> 00:06:09,470 in random forest you would have a few parameters that you can tweak. 124 00:06:09,470 --> 00:06:12,985 You think it actually allows you to build networks, 125 00:06:12,985 --> 00:06:17,295 build architectures in a way that actually resembles natural or programming language. 126 00:06:17,295 --> 00:06:19,550 Now of course, this language, is as of now, 127 00:06:19,550 --> 00:06:21,035 really hard to master. 128 00:06:21,035 --> 00:06:22,820 It's hard to tell what kind of architecture or 129 00:06:22,820 --> 00:06:24,995 what kind of trick fits this particular problem. 130 00:06:24,995 --> 00:06:26,535 And as in any other language, 131 00:06:26,535 --> 00:06:28,865 there's a lot of exceptions to this and you can 132 00:06:28,865 --> 00:06:32,085 just generally write down a set of rules and follow them everywhere. 133 00:06:32,085 --> 00:06:35,500 Hopefully, our course will help you to obtain some of this intuition. 134 00:06:35,500 --> 00:06:39,520 Though the main source of it is coding laps and not just listening to lectures. 135 00:06:39,520 --> 00:06:42,230 And of course, get much more proficient and resourceful if you actually 136 00:06:42,230 --> 00:06:46,230 solve the problems on your own and in this, I wish you luck.