[MUSIC] Many of the functions that we'd most like to differentiate are actually compositions of two different functions. This happens in the real world, too. I mean, look, if you change the number of flowers, that's going to affect say how many rabbits there are around to you know, eat those flowers. And if you change them with rabbits, that'll affect how many wolves that forest can support. There's some really concrete examples of this. Here's a concrete example. Suppose that f of x is the number of widgets produced with an investment of x dollars, right? With, with more money, maybe, you can build more widgets. Suppose g of n is the income that you get by selling those n widgets. What you're probably really interested in is not exactly how many widgets you produce. What you'd like to know is, for a given investment, how much money are you going to make, right? Well, that's g of f of x minus your initial investment, right, g of f of x is how much money comes in when you sell the widgets that you produced with your initial investment of x dollars, right? This quantity is measuring the profit on an investment of x dollars in widget production. We need some framework, some general picture that let's us understand how one thing changing affects something else and how that thing's changing goes on to affect something else. Specifically, if I've got some function h which is a composition of two functions, g of f of x in this case, I'd like to know something about the derivative of h. I want to know how changing x affects f and then how changing f goes on to affect g. And I'd like some sort of formula that gives me that answer, right? I'd like to know the derivative of h in terms of information about how x is changing affects f and how changing the input to g affects g. I want a formula for the derivative of h in terms of the derivatives of f and the derivative of g. This is exactly what the chain rule does. What the chain rule says, is that the derivative of the composition is the derivative of g evaluated at f of x times the derivative of f evaluated at x. Sometimes, people have the idea that the chain rule looks somehow, that you'd really expect the formula to look very different. I mean sometimes people think this formula looks a little bit weird, you know? I'm composing functions, but now it's the derivative of g composes just a function f. What's going on? You might think that given the fact that the derivative of a sum is the sum of the derivatives. You might be tempted to think that the derivative of a composition should be the composition of derivatives, but that's not the case. But the chain rule really is capturing what happens when you chain together these changes. So let's think about this chain rule, the derivative of g of f of x is g prime f of x times f prime of x in terms of chaining together different changes. I'm trying to calculate is how changing x changes g of f of x right? This is the derivative of the composition. What do I know? Well, I know how changing x will change f of x, right? This is what the derivative of f is, is, is measuring, right? The derivative is the ratio of output change to input change. Now, in between here, what I have is the change in f of x will change g of f of x in some way. This ratio of changes is really the derivative of g at the point of f of x. What is the derivative? You plug in an input to the derivative to ask how wiggling that input would effect the output and that's exactly what this ratio is. I'm asking how will f of x is changing affect g of f of x, right? That's the derivative of g at the point that's wiggling, f of x. Well, if you think about it, now, if I just multiply these two things together, then I get the change in g of f of x divided by the change in x. This is the chain rule, right? If I multiply together g prime f of x and f prime of x, what I'm left with is exactly what I want, the derivative of g of f of x. You can see this pictorially as well. So here, I've drawn three number lines. On the first number line, I've drawn x and I imagine x is the input to f. And on the second number line, I've drawn f of x and f of x is now the input to g. And on the last number line, I've drawn g of f of x. The essential question answered by the derivative is how changing x will affect g of f of x? But since this is a composition of functions, I'm going to analyze the effect of changing x and g of f of x in stages, right? I'm first going to see how this changing x affect f of x and how f of x is changing affect g of f of x. So let's imagine that I change x by a small quantity. I'm calling that small quantity h here, h is not a function, just some small number, the amount by which I'm wiggling the input. Now, how is the output affected? Well, that's exactly what the derivative measures. Right? The derivative of f at x tells me how wiggling the input x would affect the output. So f prime of x, which is the ratio of output change to input change times an actual input change gives me a first order approximation of the output change. So I imagine the output is changing by about f prime of x times h. Now, how does that change in value of f of x affect g? Well, I have to figure out how wiggling the input to g will affect the output of g and that depends on where I'm calculating the derivative. I need to calculate the derivative of g at the point f of x, because, f of x is the point that's doing the wiggling. So, it's the derivative of g at the point f of x that tells me how wiggling the input around f of x would affect the output to g. So it's that derivative times the amount by which the input changed, which is this quantity here, f prime of x times h. And when you look at it this way, you can see that for an input change to x of some small amount h, the output changes by about g prime f of x times f prime of x as much, which is exactly what the chain rule is telling me should be the case. Since this is the correct rule, that the chain rule really is the derivative of the outside at the inside times the derivative fu nction. Let's try to see a numerical example of this thing in action. So as a numerical example let's consider the function g of x equals x to the 4th power and the function f of x equals 1 plus x to the 3rd power. Andm maybe what I want to try to estimate is g of f of 1.0001, and now, approximately what is that equal to? Well, it's not too hard to calculate g of, of 1, right? What's f of 1? Well, that's 1 plus 1 cubed, well, that's 2. So what's g of 2> Well, that's 2 to the 4th, well, that's 16. So I know that g of f of 1.0001 is going to be close to 16. The question is, how is wiggling the input up to 1.0001 going to affect the output of this composition of functions? Well, I could do it in stages, right? That's what the chain rule's telling me to do. So I could calculate first the derivative of f at 1. Right? And the derivative of f is 1 plus 3x squared, so the derivative of f at 1 is 3. And indeed, if I calculate f of 1.0001, that's about 2.0003 and a bit more. Now, I want to try to calculate how changing the input to g will affect the output of g. So I should calculate the derivative of g and that's 4x cubed by the power rule, but where should I evaluate the derivative of g? Your first temptation is to calculate the derivative of g at 1, but that is not a good idea, because you're not wiggling the input 1 to g. What you're really should be calculating is the derivative of g at 2, because it's this 2 that's going to be wiggling. When you wiggle the input to f, it's the output to f, f of 1, that's going to be changing, so you should calculate the derivative of g there and what is that? That's 4 times 2 cubed, that's 4 times 8, that's 32. So what we're trying to calculate is g of f of 1.0001 and we know that that's about g of, well, what's f of 1.0001? It's about 2.0003. So what happens when I wiggle the input of g from 2 to 2.0003? Well, that should be about the output of g at 2 which is 16 plus how much I change the input by, times the derivative of g at the point where the wiggli ng is happening, which is 2 and that's 32. And what's 16 plus 0.0003 times 32, that's 16.0096. So g of f of 1.0001 is about 16.0096. And you can see this 96 just from the chain rule, right? The relevant thing to calculate is g prime of f of 1 times f prime of 1, right? This is going to tell me how wiggling the input 1 affects the output and g prime of f of 1 is 32, f prime of 1 is 3, and 32 times 3 is 96. So, that's the chain rule and it's going to take some time for the chain rule to really sink in. But the chain rule is super important for two very different reasons. On the one hand, you've ta know the chain rule just to be able to compute derivatives. A lot of the functions that you'll be asked to differentiate are actually compositions of differentiable functions, so you'll need to use the chain rule to finish those derivative calculations. But on the other hand, you've gotta know the chain rule just to understand how chained together changes work. In the real world, a lot of things change, and those changing things affect other things, and those changing things, then go on to affect yet other things. And you've got, got understand how those changes get composed together, in order to really understand how the real world works.