So, now that we've talked about these different memory technologies. Let's move on in talking about an introduction to caches, and why we have caches, and where this really comes from. From sort of, first principle's perspective. And, to start off motivating caches, let's draw the most basic memory system we can draw. So, on our abstract computer model, we have a processor and it connects to somewhere to store our data and our instructions, or our main memory. And, you know, the, the time it takes to actually access something like our main memory can many times be much, much larger than our cycle time. Or the latency can be, can be much longer. So, that D-RAM I passed around, you know, to get, to get a bite out of, or to get, like, the first bite out of something like a modern day's D-RAM. You're basically, it's going to be thousands of instructions. If you, if you, or probably not. It's probably right around a thousand actually, in a modern day computer. It's gotten a little bit better as we'll see on the next graph for, for a funny reason. But, and, and the bandwidth we're going to define as the, sort of, number of accesses per unit time. So, as I said, for the close in, arrays, you can usually access them much faster and have more bandwidth through them, sort of, by definition. But, for things far away, you have to, to have less bandwidth, kind of by definition cuz it should probably go through, through some pins, or some buses that go very far away. And then, we're going to define the bandwidth delayed product as the amount of data that can be in flight at the same time. And, as you have sort of things farther away and more bandwidth, this bandwidth delayed product gets bigger cuz you're multiplying, sort of, the amount of stuff the amount of bandwidth you have times where everything can be in the pipe. And you've, basically, have a, a big pipeline that serve thousands of values in flight. So, this is what I was eluding to about the, the gap happening between D-RAM, bandwidth, and processor performance. So, memory technology has been, sort of, slowly getting faster, and processor technology got a lot faster. The one thing that, at least, not so bad anymore for its, for a single scalar or, for single-processor performance we've sort of flattened out at this point. So, [laugh] now RAM is getting faster. Unfortunately, you know, this is still keep going exponentially here if you sort of look at multicore. If you have multiple cores in the same chip, you know, the, the amount of aggregate processor performance is, is going up relative to the very performance and, and things get worse. So, here we have a four issue, two gigahertz superscalar accessing 100 nano a second mem, memory, D-RAM, you know, can execute 800 instructions, during the time to access that one memory reference. And, you know, we have things faster than two gigahertz now, and we also have slower memories and we, we, you know, depending what memory system you put it in, it can, you know, some, some of them are roughly about a thousand instructions. Okay. So, what's, what's the important first order effect when we're looking at memory systems? Well, size does matter. As you have larger and larger memory systems, it just takes longer time. There's more capacity, and if you have to drive and you have to go physically farther, and you can't fundamentally fight the speed of light. It's going to take you a certain amount of time to go a certain distance. And, if that distance is farther, it's going to take you longer. So, you also have sort of, more fan out if you have bigger, bigger memory systems. So, physical size is, is important here when we're thinking about memory systems. Okay. So, let's now introduce a cache, or idea of why, why a cache is, is good. So, let's look at a notion of memory hierarchy. Unlike this picture where we had a processor connected to all of memory in the system, we're now going to put some intermediary levels. So, we have a processor, and we're going to put small-fast memories here. And we're hope, hopefully, going to try to hit the, or access these small and fast memories, instead of accessing this big slow D-RAM. Let's look at the, the capacity here. The registers, so your [unknown] registers or registers of whole bus, your program kind of specialize registers in your processor, has very small capacity. S-ram has more capacity, D-RAM has even more capacity. And, its sort of double, less than there to extenuate that you can fill a lot more in the same space. Though the latency follows that same hierarchy there. And, an important other thing to think about when you're building memory hierarchies is that on-chip memory bandwidth is typically a lot higher than off-chip memory bandwidth. If you're on-chip, you can run wires there, but the second you have to go off-chip, you have to go out the pins of the chip. You're going to be limited. And, what this really boils down to is the pins, or balls, or solder columns. So, modern day chips we don't actually have pins on them any more, instead we have ball grid array, what is little tiny solder balls. The size of those, the pitch of those is still much larger than the wire-on chip. The wire on chip is so much smaller than the solder ball, which is, or in modern-day actually, that's something like your x86 core i5 or something like that. Or, Intel core i5 actually has, what's called land grid array. Where you have a little array of little, the pads, little metal pads on the bottom of the chip. And then, there's these little pins are actually located in the socket on little, we call them, pogo sticks. But, they're basically little pieces of metal on springs and you put the chip in and then the little pogo stick hits the piece of metal and that's the contact. On older chips, you actually had the pins on there, on the chip itself, and these are sort of, put it into a big socket. And, in the middle is ball grid array. But, anyway, all those things, the connection technology to get off the chip is much larger. So, you can fit fewer of them on the same area than if you were to wire on chip. And that's really this, this insight here that on chip bandwidth is a lot larger than off chip bandwidth. And because of this, you're going to want to think about putting small, fast memory close by on chip. We can store, store data. Okay. So, what happens in this memory hierarchy? When we go to access data, if the data is in our fast memory here, we're going to go here, hit in our low latency, unscramble latency and come back. If the data is not in the fast memory, we have to go all the way out to our D-RAM and come back. And, memory hierarchy's work only if you're going to find useful data here. So, if you go build a small fast memory close by, but you never find any information in there that you need, why did you build this structure? You're just sort of doing extra work and burning extra power. And, this is one of the key insights of why we have caches, or what a cache is. Is that, when we go to build a memory hierarchy like this and we introduce something like a cache, by finding the data nearby, that's going to be faster to access. We need to find the data there with high probability. Otherwise, we should just not build it and just gone out to the main memory all time, you're just wasting everyone's time and area, and cost. So, memory hierarchy only works if you've, your fast memory actually is, is useful. Okay. So, let's look at different memory access patterns and try to come up with a couple of different techniques that we can exploit to increase the probability that the data we're going to need is in that small fast memory. So, we have an interesting graph here, and I want to explain what this is. And we'll, we'll look at a, a toy example first and then we're going to look at a memory trace, refer, memory reference trace from a real operating system with a real program running under it. So, let's start off with, something, some examples here. We have time on the horizontal axis, and we have address on the vertical axis. And, it's something like your machine, let's say, it has 4GB of RAM. We actually have sort of a line across here for every single address in the machine, and then we have time. And, as the program executes and goes and accesses different pieces of memory, we're going to put a dot on this graph. So, let's start off by looking at something like an instruction fetch. Typically, programs will execute in order. They'll execute the first instruction, the next instruction, and the third instruction at the addresses, sort of, monotonically increasing, unless you hit a branch. So, let's, let's take a look at some instruction fetches. We start off by executing some code. And then, we're going to actually execute a loop here. So, when we're in the loop, we're going to be executing the same piece of code over and over again. And if you look at that, from an address perspective, we start off, not in the loop. So, we have some address here, some different address, and then we execute these three addresses, and then we execute those three addresses again, and those three addresses again, those three addresses again. We keep fetching those three addresses over and over again from our memory system. And, at some point, we exit the loop and go execute something else. Okay. There's a pattern there. Now, we're going to think about whether we can exploit that pattern in a second. Next, let's talk about accessing the stack. So, computer programs typically have stacks. This is where you either spill or fill registers for. Typically, you also use it to pass arguments. And you spill and fill, most modern day compilers will do that, when you make a function call, or return from a function call. And the argument passing is around function calls, also. You'll put arguments that the, yet, the later functions need on, on the stack, and then you go execute that. Alternatively, you can store the arguments in the registers. But, on something like x86 will say, you're going to put arguments onto the stack, and you also put local variables onto the stack cuz, 'uz, because it gives you a convenient place to put it. So, let's take a look at what happens when you go to do a subroutine call. On a subroutine call, the stack is at one address. The stack pointer is somewhere here. And, when you go do a subroutine call, you're going to push all of your registers onto the stack. So, you're going to put something here, put something here, put something here, put something here, and the stack pointer is going to logically increase. And then, once you're in the function that you called, you're going to want to accsss, access, the arguments. And, those are usually very close together, they are all right next together on a stack, or you might have to go access the argument over and over again. You know, this is, if you don't have enough registers it's like at our first lecture we looked at a stack based architecture. And, in a stack based architecture, we had to actually push onto our stack the same value multiple times. It's pretty common to reuse the same local variable or reuse the same argument, just cuz you don't have enough registers. And, you want to go load that data from your stack over and over again. So, it's pretty common to access very close together on the stack, to be probably the same cache, within a cache line, or access the data very close by. And then, when you're done, you return from subroutine and you're going to pop all the data off the stack. So, time is increasing this way, so you're going to move in reverse order. Okay. Let's look at some data axises. The one I wanted to focus on here is actually vector axises. So, what's a vector axis? Well, this is something like an array. And arrays, a lot of times you'll be reading the first word, or the first byte, then read the next byte, the you read the next byte, then you're doing some operation across all of it. So, let's say, we're doing a matrix addition. We have one matrix and we're adding it to another matrix, and they're single dimensional matrices. You're actually going to be reading different addresses but they're going to be in some pattern, and they're going to be sort of close by. Now, if you're accessing, let's say, every fourth element in an array, this distance here would get larger. The, the address distance is going to get a little bit large and your access pattern is going to get a little bit farther a part. And, scalar values, these are sort of singleton values. So, a good example here is, let's say, this array that we're doing here is we're trying to take an array. And, instead of add two arrays together, we're going to take one array and we're going to add five to every single, or some variable, x which has we'll say, five in it, to every single element in the array. So, what your machine would, your processor would do is it would first load the thing you're going to be adding to. So, the, the data from the matrix, and then add, and then load from the scalar value, x, from main memory. Do the add and do store. Then, your load to the next location of the vector, load the file, load the x, do the add, put it back, and then continue that on. Okay. So, can we, can we try to exploit different types of vocality? And this slide is really important to grasp what's going on in caches. This is probably one of the most important slide in, in today's lecture, is to, to grasp what is going on here with the different patterns we're seeing, and how do we exploit these patterns. Okay. So, there's two type of patterns I want to talk about. And, your standard cache system is going to exploit these two patterns. But, let's first of all, look at them, them individually. And, we'll be able to see the patterns here. First one we're going to talk about is temporal locality. Temporal locality is this notion that if you access a piece of data, you're likely to access that piece of data again, in the not too distant future. So, it might be useful to keep that data nearby. So, temporal locality I'll say, again, is if they, if your location is referenced, it is likely to be referenced again in a short period of time. So, let's try to identify that in these diagrams so far. Okay, well, I see that a few places. These are giving access, we access some value, and we're going to access that value again pretty soon, and again pretty soon, and again pretty soon, and again pretty soon. And, we're basically, accessing something with high temporal locality. Likewise, these scalar values here, this scalar axis we're accessing again, and again, and again, it's this exact same address. And, the time between one axis and the next axis is pretty, pretty close by. Okay. Now, let's look at this and see other, other interesting access patterns here. Well, interestingly, on a stack, when we go to access a value and we're pushing our values onto a stack, it's very common. To access a value at say, address 1,000, and then go access a 1,000 + four, 1,000 + eight, 1,000 + twelve, 1,000 + sixteen, and access all of the nearby values. So, what are we going to call that? Is that a, is that a common pattern? Well, yeah, it's, it's pretty common and we, we see that here, we see it in loops for instructions. We're actually accessing one instruction, it's very common for us to go access later instructions very close by. Packed, let's say, next to it so this is pretty common that you just fall through in your code. So, we're going to define this and call this, spatial locality. And, spatial locality says, that if a location is referenced, it is likely that locations nearby it are going to be referenced in the near future. And these are both just heuristics. There's nothing magical about these. These are just heuristics of how programs happen to work out. And, but, we're going to build systems that try to leverage these two heuristics, or these two characteristics rather, is two types of locality to improve the performance of our processor. So, be able to use those small little RAMs that we're going to put in our memory hierarchy and actually store useful data in them. I also wanted to sort of point out that you can have both spatial and temporal locality mixed together. So, if we look at something like running a program which has lots of loops, we have spatial locality, because we're executing instructions nearby. But, we also have temporal locality here, because we're reusing all this data again, and again, and again in a short period of time. So, in the near future, we are accessing sort, of the same, same data over and over again. Even though it's not the exact same piece of data, it's, it's nearby, so we're sort of seeing both spatial and temporal locality exhibited there. Okay. So, let's look at, finally at a real memory reference pattern or real memory reference trace. And this has the same axises of what we saw before. We have time, and there's one dot per axis, and then we have memory addresses. I don't know what the unit on the vertical is, it's probably not, it's probably in like thousands or millions or something like that. And this is from a study that some people at IBM did many years ago. And, but we're going to look at this and see different patterns show up here. So, the first thing I want to circle here is spacial locality. So, we see that along this line here, we access the first piece of data at a, let's say, this address, and we then go on and access all of the data subsequently. So, this is as if we were like reading an array. We just read all the values in the array. So, we have spacial locality that because we went and accessed element one in the array, there's a, a high probability that we're going to go access other data values which are nearby in that array, in a, in a nearby time. Let's take a look at temporal locality. So, horizontal lines here are temporal locality. And, you can see that, this program is accessing this value over and over again. Or, at least, values very close by, it's hard to tell on this resolution, but this is temporal locality that we're accessing the same values or, or mostly the same values over, and over, and over again. And then, finally, we can see both temporal and spatial locality in these little blocks here. We can see that we're accessing similar sorts of addresses which is going to be spatial locality, and we're doing it all within a short period of time. So, we're, we're seeing temporal locality of reference. And, caches can exploit both of these. Caches are going to exploit temporal locality by keeping data that gets used and reused many, many times, nearby. So, when you see axises which are temporally, close by or, or, temporally spaced so in time or space nearby, what's going to happen is a cache is going to sit somewhere here before you get to a main memory system. And, when you go to access a value, it's going to memorize that and keep that value nearby. So, if you go to access it again, in time, and if your program exhibits temporal locality, that data value will be close by and you won't have to go out to the big slow RAM. And, caches can exploit spatial locality. And, the way caches go about exploiting spacial locality is that, when you go to access a particular piece of data, your cache is going to pull in data around that piece of data. And, we'll talk more about this next time how, how that works. But, we're going to segment data into what we call, cache blocks, or cache lines. And, that's typically, those, those cache blocks or cache lines are aligned. And, when you go to access a piece if data, it might be somewhere at the beginning, or at the end. But, or might be in the middle, you pull in all the data around that. And the, the heuristic here is that if you access some piece of data, let's say, you only go access the first byte in a cache line, you're likely to go access other bytes in a block. So, multiple, pieces of data around that. So, we are able to exploit spatial locality here.