Another useful way you can introduce hints is by the way you train the model. Say, you have a problem of image classification, but this time it's kind of near impossible. You only have 5,000 images, labeled ones, and you want to take an image of a person and use it to predict the style of his clothing. So, smart, formal whatever. In this case, it's really hard to obtain labelled images because you would have to label them all by hand or hire people to do this for you. The problem here is that if you fit a new letter, you actually have two choices. Either you use a normal-sized network, but since there is so little amount of data, it will probably over fit before it ever fits to anything. Another way is to use a highly over-regularized network domain like one layer and like two neurons. But in this case, it won't be able to learn anything better than a linear model, it's that small, remember. Fortunately, deep learning allows you to kill both birds with one stone. In this case, you can build a large network which would nevertheless not overtrain too much. We can do so by introducing other problems that it should also solve. Let's say that this clothing separation problem is too hard to obtain data on but we have a different problem. Let's say we have, for the same data, we have age and gender labeled. This is much simpler since you can probably extract a lot of such information from social networks. You can just parse everything you've got. It might be slightly illegal but let's assume you work in the social network company and you own this data. In this case, what you want to do, you want to train your network to predict both style and those age, gender, those kinds of features. And you want to do this the following way. So, you feed the portrait, the photo through the first layer and then the second one. And then, there is a split after which each kind of hand of your network is used to predict a different class. Now, this architecture express a very powerful idea. You say that you want your first two dense layers not only learn features useful for predicting the style of clothes but also useful in determining person's age or gender. Now, this is very useful because the second domain actually contains much more images and it's hard to overfit on it. So in this case, you'll learn all the features that are useful for both worlds but they won't to be able to overfit your problem before they ever fit. That is actually the features that suit this definition perfectly. For example, if you are working with the wrong image pixels, it kind of makes sense that before trying to determine style or age or gender, it is important to find where a person's head is, at least. In the first case, if you want to determine person's clothing style, it makes sense because you'll then be able to find whether he has some kind of jewelry which is indicative of the style. And the second problem, when will the face is helps you determine whether a person has a facial hair, for example, which is a telltale sign of a man, so far as I remember. And this actually means that you'll be able to train low-level features. A lot of low-level features for free with almost no error fittings long as your second domain is large enough in terms of data available. When training such a network, you'll have to feed it with mini-batches from both those problems in an automated fashion. For example, you could first sample a mini-batch of images from which you know age and gender, and train the network through its second head, then sample the second mini-batch which contains images for which you know the styling and train the first head. There's, of course, more than one way you can organize the training procedure. For example, you can take too many batches of data in the second domain for just one in the first one, or you can try to first tune your network to convergence on our age and gender prediction only then start fixing it on the first problem. We'll study this idea to a much greater extent in the following mood dedicated to computer vision. Now, the general idea is that regardless of which method you use, you still get this neat property that you can train your huge neural network to get reasonable features even though the original data is really scarce. So, I've just seen a few of those kind of words of power in deep learning. First one was about managing the level of abstraction. See, I want features that are much more abstract than they're all imaged pixels here so that I can use them together with other features. The second one was managing the amount of features so you can say, don't trust those guys too much, trust these guys instead. The final one was, could you please train features that are not only useful for my super small problem but also generally useful for similar problems? There's, of course, much more in deep learning. And here are some of the examples of other ideas you can incorporate in your neural network architecture that will appear later in our course. For example, if you want to solve an image classification problem so you want to tell cats from dogs, it makes a lot of sense to make the feature you learn invariant to the position of this cat or dog. Cats may appear in the middle or in the top right or bottom left corner. You want your model to be able to detect him regardless of where he is. There's also another way you can teach your neural network to be kind of robust, resilient to small shifts in data. If a cat slightly moves his paw, it doesn't mean it has stopped being a cat. For natural language applications, you'll learn about how do you teach your neural network to find the underlying cause of the data. Say, you have words and you want to classify the sentiment. Instead of working on the level of words, say, beg a force notation, you'll teach your neural networks to find a hidden structure, hidden process that generated those words, basically reverse-engineering human mind. There's also a way you can train your neural network to obtain some particular property of their presentations in the intermediate layers. For example, you may want your neural network to be robust in a way that it doesn't trust one single feature too much. Or you can try to adjust the scene representation to be sparse so you may train your neural networks so that almost all neurons eye zeros for any given object of data. Of course, there's much more to it. I just barely scratched the surface of this idea of deep learning being a language and as we'll go further, you'll study much more powerful tools to play with. Now, the key difference between deep learning and other machine learning methods, in my humble opinion, is that, well, in random forest you would have a few parameters that you can tweak. You think it actually allows you to build networks, build architectures in a way that actually resembles natural or programming language. Now of course, this language, is as of now, really hard to master. It's hard to tell what kind of architecture or what kind of trick fits this particular problem. And as in any other language, there's a lot of exceptions to this and you can just generally write down a set of rules and follow them everywhere. Hopefully, our course will help you to obtain some of this intuition. Though the main source of it is coding laps and not just listening to lectures. And of course, get much more proficient and resourceful if you actually solve the problems on your own and in this, I wish you luck.