1 00:00:00,000 --> 00:00:03,791 [MUSIC] 2 00:00:03,791 --> 00:00:05,210 Welcome back. 3 00:00:05,210 --> 00:00:09,859 We just studied about how Deep Learning works and how to train your [INAUDIBLE]. 4 00:00:09,859 --> 00:00:13,190 And with some luck, you've already made it through the practical assignments. 5 00:00:13,190 --> 00:00:16,870 So you basically know that deep learning can help you when one of your models 6 00:00:16,870 --> 00:00:18,240 just don't cut it. 7 00:00:18,240 --> 00:00:21,400 And you probably hope that the process will repeat 8 00:00:21,400 --> 00:00:23,631 itself along other [INAUDIBLE]. 9 00:00:23,631 --> 00:00:25,899 To a large extent that this is true, but today, 10 00:00:25,899 --> 00:00:29,155 let's talk about where this isn't and where deep learning is not. 11 00:00:29,155 --> 00:00:31,160 You know these things, some of these things. 12 00:00:31,160 --> 00:00:33,730 Now, let's talk about some of the limitations. 13 00:00:33,730 --> 00:00:35,520 Now, what deep learning is, 14 00:00:35,520 --> 00:00:38,650 the one thing deep learning is not is deep learning is not magic. 15 00:00:38,650 --> 00:00:40,690 It won't just solve all the problems for you. 16 00:00:40,690 --> 00:00:45,717 It won't be this silver bullet that you can just unpack from [INAUDIBLE] and 17 00:00:45,717 --> 00:00:50,923 hope that it gets much better than anything you tried previously for years. 18 00:00:50,923 --> 00:00:54,101 This is what a lot of people expect from neural network, but please don't, 19 00:00:54,101 --> 00:00:56,150 because it won't solve your problems for free. 20 00:00:57,410 --> 00:00:59,860 Instead, deep learning is just a practical field. 21 00:00:59,860 --> 00:01:02,964 It has its strengths, we'll talk about them in the second part. 22 00:01:02,964 --> 00:01:05,931 But it also has its weak points. 23 00:01:05,931 --> 00:01:09,940 And for the one thing, deep learning lacks this core theoretical understanding. 24 00:01:09,940 --> 00:01:12,930 It sounds like a lame accusation when you talk about the practical [INAUDIBLE] or 25 00:01:12,930 --> 00:01:16,950 absence if a theory isn't obviously preventing it from working. 26 00:01:16,950 --> 00:01:20,670 But the problem here is that when you try to build an architecture, 27 00:01:20,670 --> 00:01:24,630 develop something new for a model, an absence of a theoretical kernel that'll 28 00:01:24,630 --> 00:01:29,493 be able to explain stuff for you actually makes you do a lot more experimentation. 29 00:01:29,493 --> 00:01:34,040 It's the fact deep learning only offers you some [INAUDIBLE] like this works, 30 00:01:34,040 --> 00:01:38,770 this idea kind of applies everywhere where you have this situation and so on. 31 00:01:38,770 --> 00:01:44,200 But those intuitive kind of rules, they are not 100% accurate, 32 00:01:44,200 --> 00:01:45,990 and this is a problem if you want to develop something new. 33 00:01:47,170 --> 00:01:50,240 Not a problem is that turn complex dependencies, neural networks and 34 00:01:50,240 --> 00:01:53,230 deep learning models and journal have a lot of parameter. 35 00:01:53,230 --> 00:01:56,481 This not only means that they can capture complex dependencies in the data but 36 00:01:56,481 --> 00:01:58,143 they can also over feed tremendously. 37 00:01:58,143 --> 00:02:02,031 This means that for any problem, if you neural network, you generally need 38 00:02:02,031 --> 00:02:07,250 much larger dataset to train on, then you would if linear models or decision trees. 39 00:02:07,250 --> 00:02:10,772 Whenever you end up in some new area which is not some image classification or 40 00:02:10,772 --> 00:02:14,295 text processing, sometimes you'll find out that for practical reasons, 41 00:02:14,295 --> 00:02:17,009 it's better to use decision trees or even linear models. 42 00:02:18,810 --> 00:02:21,700 Now, finally, deep learning models are computationally heavy. 43 00:02:21,700 --> 00:02:24,910 And whenever you want your machine learning to run super fast or 44 00:02:24,910 --> 00:02:26,760 require as little memory as possible. 45 00:02:26,760 --> 00:02:31,634 So if you're running on smart phones or embedded systems, you'll generally have to 46 00:02:31,634 --> 00:02:36,390 do some, again, dark magic to make your neural network run as fast as you require. 47 00:02:36,390 --> 00:02:41,057 This isn't true for C-linear models, that apply almost instantaneously. 48 00:02:41,057 --> 00:02:43,300 There's one more disadvantage. 49 00:02:43,300 --> 00:02:49,361 It's kind of well, it's hard to fix a disadvantage. 50 00:02:49,361 --> 00:02:52,072 It has some strong points. 51 00:02:52,072 --> 00:02:55,294 But the deep learning is pathologically overhyped. 52 00:02:55,294 --> 00:02:59,340 But basically, machine learning, the super domain of deep learning, 53 00:02:59,340 --> 00:03:00,610 is overhyped as well. 54 00:03:00,610 --> 00:03:04,293 But deep learning is the most kind of advertised, the most hot topic within 55 00:03:04,293 --> 00:03:07,750 the most hot area of the mathematics, which is the machine learning. 56 00:03:08,760 --> 00:03:12,492 This is good because deep learning attracts a lot of talented researches and 57 00:03:12,492 --> 00:03:13,842 talented practitioners. 58 00:03:13,842 --> 00:03:16,606 But the problem is that since it's so hard, 59 00:03:16,606 --> 00:03:19,710 a lot of people expect wonders from it. 60 00:03:19,710 --> 00:03:22,500 So sometimes you'll find yourself, if you're trying to apply deep learning in 61 00:03:22,500 --> 00:03:26,030 business, find yourself in a company of people who don't understand deep learning. 62 00:03:26,030 --> 00:03:29,160 They believe that it's some super artificial intelligence big data blah 63 00:03:29,160 --> 00:03:30,630 blah, yada yada yada. 64 00:03:30,630 --> 00:03:33,940 It will get you top one position in the business and 65 00:03:33,940 --> 00:03:35,690 solve all your problems for you. 66 00:03:35,690 --> 00:03:39,566 So not only you should not expect deep learning to make wonders yourself, 67 00:03:39,566 --> 00:03:42,705 because wonders, as you know, require all the hard work. 68 00:03:42,705 --> 00:03:47,066 You also have to fight with other people who expect otherwise. 69 00:03:47,066 --> 00:03:51,531 Now all those arguments draw a rather grim picture what deep learning is, but 70 00:03:51,531 --> 00:03:55,110 there are a lot of positive sides of it as well. 71 00:03:55,110 --> 00:03:59,700 For one, you can think of deep learning as this kind of machine learning language, 72 00:03:59,700 --> 00:04:00,880 like any language. 73 00:04:00,880 --> 00:04:04,180 A language is a tool to express something. 74 00:04:04,180 --> 00:04:06,656 A national language is a tool which you can express at least to all the humans. 75 00:04:06,656 --> 00:04:11,211 And a programming language means to express what you want your computer to do 76 00:04:11,211 --> 00:04:14,640 in a way that the computer can execute it. 77 00:04:14,640 --> 00:04:17,670 And deep learning, in turn, is a language that allows you to 78 00:04:17,670 --> 00:04:21,780 hint your machine learning model about what you want this model to learn. 79 00:04:21,780 --> 00:04:25,040 Hint is about what kind of features you want it to have, 80 00:04:25,040 --> 00:04:29,571 and what kind of expert knowledge can be applied for on this dataset. 81 00:04:31,000 --> 00:04:34,465 Let's draw a few examples to prove this point. 82 00:04:34,465 --> 00:04:36,873 Let's say you have a usual classification problem. 83 00:04:36,873 --> 00:04:37,905 You have two sets of features. 84 00:04:37,905 --> 00:04:41,636 You have the raw low-level features, and the high-level features. 85 00:04:41,636 --> 00:04:43,828 And you want to predict some kind of target given this. 86 00:04:43,828 --> 00:04:46,092 This whole thing is being to sound a little abstract, so 87 00:04:46,092 --> 00:04:47,545 let's get to a concrete scenario. 88 00:04:47,545 --> 00:04:50,942 Say you want to make a regression on the price of a car, a second hand car, 89 00:04:50,942 --> 00:04:51,853 to be accurate. 90 00:04:51,853 --> 00:04:56,075 Have a, well, a photo of a car, and some high level features like a brand, 91 00:04:56,075 --> 00:04:59,628 some model, maybe a production dates, and some blemishes and 92 00:04:59,628 --> 00:05:01,729 enhancement installed in this car. 93 00:05:01,729 --> 00:05:04,277 What you want to do is you want to build a model that 94 00:05:04,277 --> 00:05:06,470 uses both of those feature types. 95 00:05:06,470 --> 00:05:09,118 And the simplest way to do so is to just concatenate them and 96 00:05:09,118 --> 00:05:12,805 feed them into a whole [INAUDIBLE] using neural network in your model, whatever. 97 00:05:12,805 --> 00:05:14,426 Of course you can do that, 98 00:05:14,426 --> 00:05:18,272 but the problem with this is approach is kind of inefficient. 99 00:05:18,272 --> 00:05:22,540 Now if we speak about neural networks, the resulting model would like this, for 100 00:05:22,540 --> 00:05:23,810 example. 101 00:05:23,810 --> 00:05:28,110 And the problem with this model, the main one, is that the first dense layer here, 102 00:05:28,110 --> 00:05:31,305 it tries to combine two worlds, two domains of features, and 103 00:05:31,305 --> 00:05:33,159 it tries to combine them linearly. 104 00:05:33,159 --> 00:05:36,699 So, what it does is it takes the age measured in years or months. 105 00:05:36,699 --> 00:05:41,984 And it's multiplied by some coefficient, and adds up with a pixel intensity. 106 00:05:41,984 --> 00:05:43,338 It's technically possible. 107 00:05:43,338 --> 00:05:47,642 I mean no one will punish you from doing so unless there's a physicist nearby. 108 00:05:47,642 --> 00:05:52,353 But it's kind of [INAUDIBLE] and it made practical application this architecture 109 00:05:52,353 --> 00:05:54,993 tends to work worse than it otherwise could. 110 00:05:54,993 --> 00:05:58,440 What you can do is you can save the following thing of this language. 111 00:05:58,440 --> 00:06:02,440 You can see that you want to view the representation for those raw features, 112 00:06:02,440 --> 00:06:04,963 which is as complex as those high-level ones. 113 00:06:04,963 --> 00:06:08,860 The way you can express this is by, well, taking more layers. 114 00:06:08,860 --> 00:06:12,594 Now, basically you have two branches of data. 115 00:06:12,594 --> 00:06:15,016 And for some amount of time, you persist them independently. 116 00:06:15,016 --> 00:06:20,222 You have those raw features and you apply dense layers, stacked maybe two, 117 00:06:20,222 --> 00:06:25,220 three dense layers, that only extract features from raw image pixels. 118 00:06:25,220 --> 00:06:29,605 And only then, once you've got those features like a presence of a blemish, or 119 00:06:29,605 --> 00:06:32,222 maybe a crack on the front glass, or anything, 120 00:06:32,222 --> 00:06:36,758 only then you combine those features with the high-level features you've got. 121 00:06:36,758 --> 00:06:40,957 Now, it makes slightly more sense, although it's not the perfect model. 122 00:06:40,957 --> 00:06:44,175 Generalists taking more layers to extract features is also in the more abstract kind 123 00:06:44,175 --> 00:06:45,190 of features. 124 00:06:45,190 --> 00:06:46,945 And if you stack enough layers, 125 00:06:46,945 --> 00:06:50,521 you'll eventually get features that are easy to combine there. 126 00:06:50,521 --> 00:06:53,450 So let's now consider a similar although a slightly different problem. 127 00:06:53,450 --> 00:06:56,165 This time, we're still solving car price regression. 128 00:06:56,165 --> 00:06:59,580 But we want to also infuse another [INAUDIBLE]. 129 00:06:59,580 --> 00:07:03,969 They say that through some kind of external information that we've got, 130 00:07:03,969 --> 00:07:08,378 we don't want our network to trust the image data too enthusiastically. 131 00:07:08,378 --> 00:07:11,618 For example, I might be unwilling to trust the car dealers that much. 132 00:07:11,618 --> 00:07:16,418 Let's say that some of their images have shown to be too optimistic, 133 00:07:16,418 --> 00:07:20,573 and showing a car in a condition better than the actual one. 134 00:07:20,573 --> 00:07:22,870 By default, our network does the exact opposite. 135 00:07:22,870 --> 00:07:27,000 It trusts the images too much because there is say 10,000 image pixels, 136 00:07:27,000 --> 00:07:30,400 100 by 100% pixels. 137 00:07:30,400 --> 00:07:34,490 And there's only say, 100 attributes that are high-level features. 138 00:07:34,490 --> 00:07:36,062 So we want to do the opposite. 139 00:07:36,062 --> 00:07:40,010 You can of course achieve this by means of applying usual machine learning. 140 00:07:40,010 --> 00:07:43,625 Simply over-glorizing the raw features, the pixels, or 141 00:07:43,625 --> 00:07:45,596 maybe you're [INAUDIBLE] here. 142 00:07:45,596 --> 00:07:49,276 But in deep learning, you can also do this by means of architecture. 143 00:07:49,276 --> 00:07:52,610 In this case, we have introduced the thing called the bottle neck layer. 144 00:07:52,610 --> 00:07:55,030 This one layer, this one with 32 units, 145 00:07:55,030 --> 00:07:58,600 which is much smaller than any other layer. 146 00:07:58,600 --> 00:08:01,890 And it's the bottle neck, so any information that neural 147 00:08:01,890 --> 00:08:05,220 network takes from the image, it should go through this layer. 148 00:08:05,220 --> 00:08:09,141 It kind of limits the amount of useful features your model can get, and 149 00:08:09,141 --> 00:08:11,942 biases toward trusting raw image features less. 150 00:08:11,942 --> 00:08:13,510 This is of course not guaranteed. 151 00:08:13,510 --> 00:08:16,470 So, technically, if you feed your model for too long, 152 00:08:16,470 --> 00:08:21,050 it might just encode everything in this super-complex non-linear dependency and 153 00:08:21,050 --> 00:08:22,810 still get all the information through. 154 00:08:22,810 --> 00:08:25,566 But it's one way you can approach this problem. 155 00:08:25,566 --> 00:08:35,566 [MUSIC]