Okay. So, we, we talked about vector processors. We introduced this sort of short vector idea with single instruction multiple data instructions. And, I wanted to talk a little bit about one of the common places that you see something that's like a vector processor, but isn't quite a vector processor. And that's in graphics processor units. So, this is the graphics card in your computer. So, my laptop here has a ETI graphics chip in it. And it actually, I've, I've run on it open CL code compiled to it, which is a general purpose usage of a graphics processor unit. So, one of the things I wanted to kick this off by was saying that these architectures look strange from a computer architecture perspective. And it really goes back to the fact that they were not designed to be general purpose computing machines. They were designed to render 3D graphics. So, the early versions of these had very fixed pipelines, and they had no programmability. So, you couldn't even use them to do general purpose computation. And then, they start to get a little bit more flexibility. So, an example of this actually was, the original Nvidia chips had something called pixel shaders. So, the idea is, on per pixel basis, as you render a three-dimensional picture for like a game or something like that. Each pixel, you could say, oh, I want you have some little custom, customization on how we render the pixel. So, it's little, every, the game programmer, for instance, got to write a little program for each pixel as they, as they got rendered to the screen. So, ran, ran a little program. But what's interesting about this, and, and these pixel shaders, there was actually a programming language that developed for the pixel shaders. And the first implementation of general purpose computing on these was people wrote pixel shaders to render on pixels, but actually, did something else with it. So, they went and computed some, some other program using these pixel shaders. So, you could try to do a matrix multiplication and the result would be a picture. [LAUGH] Kind of a strange idea there. It's like, well, we take one picture and another picture, and we run some special shading on it that makes it look sort of like a 3D picture. And all of a sudden, the output actually is the result. Well, the graphics people, the people who made graphics per graphics processing units got the bright idea that, well, maybe we can make this a little bit easier and increase our user base, just beyond having graphics cards if we encourage people to write programs on these. So, they start to make it more and more general purpose. So, instead of just being able to do a pixel shader or work on one pixel at a time, and have it very pixel oriented, or graphics oriented, we'll rename everything and we'll come up with some programming model, and we'll expose some of the architecture, it'll make the architecture a little more general purpose. And then, you might be able to run some other programs on it. So, this is, this is really what, what happened. And, and as I said, the, the, one of the first people that tried use this and were trying to program it in the, the pixel shaders and the, the, the per vector per vertex computations which were also sort of baked into these original architectures were very hard. because you're basically trying to think about everything as a picture in a frame, and then sort of back compute what was going on. But as we've moved along a little bit, we've started to see some new languages. So, it's new program support, and the architectures become more general purpose. So, this brings us to general purpose graphics processing units. So, it's, we still have GPU here, so it's still special purpose. But then, we stick GP in the front and we call this GPGPUs, General Purpose Graphics Processing Units. So, a good example of this actually is Nvidia decided to or, or came up with this programming language called CUDA. Now, this was not the first sort of foray into this. There are some research languages that predated this, and some ideas that predated this. But, the I, the idea here well we'll talk about that in a second actually. But the, the GPGPUs the programming model's a little bit strange. It's a, it's a threading model. And we're going to talk about that in, in a little bit more detail. But, first of all, I wanted to point out some differences between GPGPU and a true vector processor. So, in a GPGPU, there's a host CPU, which is your X86 processor in your computer. Then, you go across the bus, the PCIE bus, or PCI bus, or EGP bus out onto a graphics card. So, there's no, there is a host processor, but it's not, there's no control processor like in our vector processors that we had talked about last lecture. So, the scalar processor is really far away, and doesn't drive everything as, as strictly, it's basically having two processors running, the host processor, and then, the graphics processor unit. And because of this, you actually have to run all of your control code on the vector unit, somehow. So, this this attached host processor model does have some advantages. You can actually run the data parallel aspects of the program and then have the host processor running something else, or the SA6 processor, which is connected to it across the bus have it running some other, other program. So, let's, let's dive into detail here and talk about the Compute Unified Device Architecture, which is what CUDA stands for. So, CUDA is, is the Nvidia way of programming, or the Nvidia programming model for the graphics processors. And, there's a broader industrial accepted way to program these, which uses roughly a similar programming model called Open CL which is, the name there is designed to evoke notions of open GL, which is the, the graphics language that is widely used for 3D rendering. So, you have to suspend disbelief here for a second. And, and this, this model is a little bit odd, I think. But, let's, let's, let's talk about what, what CUDA is. So, in CUDA, let's say, we have a loop here. let's, let's say, a non-CUDA program first. It's the upper portion of this code. We're trying to take y plus or, or y of i plus x of i times some scalar value. So, this is the traditional A times another vector plus a third vector. This is actually a inner loop of this shows up an inner loop of impact. So, is that, which is a, a benchmark that a lot of people run. So, pretty soon what we were doing before. You're adding one vector to another vector, except that you're taking one of the vectors and multiplying by a scalar. Pretty, pretty simple. So, in CUDA, the, the basic idea is that you don't do a lot of operation, you don't do a lot of work per, what we're going to call threads. So, what we're going to do here is we're going to define a block of data, and we use some special tabs on here to say what's going on. And, [COUGH] this block has some size. And then, what we do is, we define a thread that can operate in parallel times these 256, or there's 256 data values and 256 threads. So, what's going to happen here is we're basically going to do the same operation here, y times y of i plus a times x of i, and we're going to store it into here. But, let's look where i comes from. i comes from some special keywords here where it computes which thread number you are or which index number you effectively are in this thread block. [COUGH] And then, this if statement here is the moral equivalent of our strip mining loop. So, this says, if we get above how ever many, however much work we're trying to do, don't do anymore work. So, we can block it into, let's say, some number of threads. And then, we can actually pass into an n where n is the amount of work to actually do or the length of the vector, and we can check that along the way. What I find interesting about this is, if you look at this on first view, it looks like each of these threads, oops, is completely independent. So, that is the programming model, is that each of the threads are independent. Unfortunately, the computer architecture that this is going to run on, each of the threads are not independent. So, in CUDA, you don't want to have these diverge. It's allowed to have them diverge. So, for instance, there's a, there's a if statement in here. So, you're going to have one go into the if statement, and one, one, a different thread follow through on the if statement. So, it is, it is allowed. But if you do that, you're basically just not using one of the pipelines for a while. So, I should of come back to this. Lets, let's talk about how this programming model shows up. So, the programming model is having lots of different little threads and the idea is you make sort of these micro threads. And then, you want to take these micro threads and the run time system plus the compiler, the CUDA compiler, will put it together. And actually put all the threads together, and actually have them operate on the exact same instructions at the same time. So, they're hiding a single instruction multiple data architecture under the hood of these threads. So, if we look at our example here, we do a load. So, multiply a different load add in the store, and that's, that's our a, x a times x plus y operation going on here. And, across the other way, these are all these threads are doing the same operation or effectively the same operation. So, they call these single instruction, multiple thread, which is kind of a funny idea. In reality, it's actually single instruction multiple data. But, they introduced this notion of threads with predication into the thread in, in, into the single structure multiple data to allow, effectively, one pipe, one microthread here to do something slightly different, by predication. So, what is the implications of the single instruction multiple thread? Well, strangely enough, because you have it's hard to control the order of the threads relative to each other with the data, the memory system has to support all different notions and all different alignments of scatter-gather operations. So, they don't actually try to control the addressing because each of the threads could potentially try to do some scatter operation. So instead, what they do is, they have some really smart intelligence that takes all the addresses that come out of the execution units. And they say, oh, these look like they should line up and we'll issue these at the same time. So, if you happen to have threads which all try to do, let's say, a of i, we'll say, where i is a thread number, then you have units trying operations. If you try to pack them all together and the hardware actually goes and tries to figure this out. And, as I had mentioned before, if you, you have to use predication here if you have different control flow. So, you need strong, strong use of predication to allow threads to go different directions. So, things get even more complicated in these architectures. These GPU, general purpose GPU architectures and take the word warp here and replace it with thread. so unfortunately, if you, if you go read your textbook, they have a nice table which translates GPGPU nomenclature to the whole rest of computer architecture nomenclature. But, the GPU people came up with completely different names for everything which is just kind of annoying. Because names already existed for everything. So, if you go look inside of one these GPUs, they actually are a massively parallel architecture with multiple lanes, like our vector architecture. And then, on top of that, they are a multi-threaded architecture. Typically, these architectures don't have caches. So, to hide a lot of the memory latency what they'll do is you'll take all the threads that are active in the machine, and this is part of the reason they have this threading model, is you'll take all the threads that are active in the machine and you'll schedule one thread. And if that thread, let's say, misses out to memory, you'll time slice it out and then schedule a different thread. So, that actually we'll fine grain interweave threads on a functional unit. So, it's a strange idea here mixing multi-string with SIM D at the same time. So, lots of different parallels and aspects coming together in these GPGPUs. I don't want to go into that much detail because then you have a whole class on how to program GPGPU's. But basic idea I wanted to get across is that they are a multi-threaded single instruction multiple-data machine, but they overlay on top of that, this strange notion of threads. And, but, the threads don't do exactly the same work because it's a SIM D machine. you basically just end up wasting slots. So, these have a lot of performance. So, some examples of this the Nvidia, this is actually the modern day, Nvidia computer archi, or GPU architecture you can go buy. They call it the Fermi. this is in that card I showed last time with the, the, the Tesla. let's zoom in on one of these and talk about what's inside of here. So, roughly what's going on is they actually have, well, first of all, let's see the stuff that's not programmable, which is actually a significant portion of the design here. If you look down here, they have vertex shaders and tessellation units and texture mapping units, texture, texture caches And really, what this is all for, is this if for graphics processing. [LAUGH] And then, [LAUGH] we sort of smush onto that, some array of general purpose units or mildly general purpose units. And inside of each one of these cores here, there's a floating point unit and an integer unit. And if you cut this way, that's actually each one of these here what they call core, is effectively a lane. So, they're replicated in that direction. And then, one, two, three, four, five, six, seven, eight, this direction is SIM D. So, they basically have a SIM D architecture with multiple lanes. So, lots of parallelism going on. And then, at the top here, they have what they call the warp scheduler which is the thread scheduler, which will assign instructions down into the different parallel units. So, lots of, lots of interesting things going on in parallel here on these machines. So, I wanted to stop here on GPUs because we could have a whole another lecture on that. I, I don't really want to go into that level of detail on GPUs. But, let's switch topics and start talking about multi-threading. Actually, before we go off this thing, I'm sure people have questions about GPUs. But I'll take one or two of them, but I don't want to go into that much detail. I just want to sort of introduce the idea that you could use graphics processors. They have some similarity to vectors, but they're kind of the degenerate case of vector processors. Actually, a quick show of hands. Who, who, who has a ATI video card in their machine? Okay. Who has Nvidia? Okay. Who has Intel? Aha. So, so, interesting tidbit here. Everyone always thinks about ATI and Nvidia as being the sort of leaders in these fancy graphics cards. But in reality, Intel sells the most number of graphics processors in the world today. And it's partially because they kind of give them away for free, and they've effectively integrated them onto all the Intel chips now. So, it's kind of a funny thing that the least innovative the, the, the least exciting graphics processors out there are, are kind of there not because they're good, but because they're cheap. Lots of, lots of things in the world work like that.