1 00:00:03,240 --> 00:00:05,840 Okay. So, we, we talked about vector 2 00:00:05,840 --> 00:00:09,741 processors. We introduced this sort of short vector 3 00:00:09,741 --> 00:00:14,024 idea with single instruction multiple data instructions. 4 00:00:14,024 --> 00:00:17,152 And, I wanted to talk a little bit about one 5 00:00:17,152 --> 00:00:22,444 of the common places that you see something that's like a vector processor, 6 00:00:22,444 --> 00:00:27,665 but isn't quite a vector processor. And that's in graphics processor units. 7 00:00:27,665 --> 00:00:30,911 So, this is the graphics card in your computer. 8 00:00:30,911 --> 00:00:34,297 So, my laptop here has a ETI graphics chip in it. 9 00:00:34,297 --> 00:00:39,801 And it actually, I've, I've run on it open CL code compiled to it, which is a 10 00:00:39,801 --> 00:00:43,760 general purpose usage of a graphics processor unit. 11 00:00:43,760 --> 00:00:48,839 So, one of the things I wanted to kick this off by was saying that these 12 00:00:48,839 --> 00:00:53,482 architectures look strange from a computer architecture perspective. 13 00:00:53,482 --> 00:00:58,808 And it really goes back to the fact that they were not designed to be general 14 00:00:58,808 --> 00:01:03,451 purpose computing machines. They were designed to render 3D graphics. 15 00:01:03,451 --> 00:01:08,367 So, the early versions of these had very fixed pipelines, and they had no 16 00:01:08,367 --> 00:01:11,918 programmability. So, you couldn't even use them to do 17 00:01:11,918 --> 00:01:16,445 general purpose computation. And then, they start to get a little bit 18 00:01:16,445 --> 00:01:20,303 more flexibility. So, an example of this actually was, the 19 00:01:20,303 --> 00:01:24,030 original Nvidia chips had something called pixel shaders. 20 00:01:24,030 --> 00:01:29,165 So, the idea is, on per pixel basis, as you render a three-dimensional picture 21 00:01:29,165 --> 00:01:34,435 for like a game or something like that. Each pixel, you could say, oh, I want you 22 00:01:34,435 --> 00:01:38,827 have some little custom, customization on how we render the pixel. 23 00:01:38,827 --> 00:01:43,692 So, it's little, every, the game programmer, for instance, got to write a 24 00:01:43,692 --> 00:01:48,828 little program for each pixel as they, as they got rendered to the screen. 25 00:01:48,828 --> 00:01:55,193 So, ran, ran a little program. But what's interesting about this, and, 26 00:01:55,193 --> 00:02:00,855 and these pixel shaders, there was actually a programming language 27 00:02:00,855 --> 00:02:05,504 that developed for the pixel shaders. And the first implementation of general 28 00:02:05,504 --> 00:02:10,214 purpose computing on these was people wrote pixel shaders to render on pixels, 29 00:02:10,214 --> 00:02:14,803 but actually, did something else with it. So, they went and computed some, some 30 00:02:14,803 --> 00:02:19,331 other program using these pixel shaders. So, you could try to do a matrix 31 00:02:19,331 --> 00:02:22,290 multiplication and the result would be a picture. 32 00:02:22,290 --> 00:02:26,988 [LAUGH] Kind of a strange idea there. It's like, well, we take one picture and 33 00:02:26,988 --> 00:02:32,585 another picture, and we run some special shading on it that makes it look sort of 34 00:02:32,585 --> 00:02:36,592 like a 3D picture. And all of a sudden, the output actually 35 00:02:36,592 --> 00:02:40,636 is the result. Well, the graphics people, the people who 36 00:02:40,636 --> 00:02:45,940 made graphics per graphics processing units got the bright idea that, well, 37 00:02:45,940 --> 00:02:50,036 maybe we can make this a little bit easier and increase our user base, just 38 00:02:50,036 --> 00:02:54,186 beyond having graphics cards if we encourage people to write programs on 39 00:02:54,186 --> 00:02:56,644 these. So, they start to make it more and more 40 00:02:56,644 --> 00:02:59,757 general purpose. So, instead of just being able to do a 41 00:02:59,757 --> 00:03:04,235 pixel shader or work on one pixel at a time, and have it very pixel oriented, or 42 00:03:04,235 --> 00:03:08,058 graphics oriented, we'll rename everything and we'll come up with some 43 00:03:08,058 --> 00:03:12,154 programming model, and we'll expose some of the architecture, it'll make the 44 00:03:12,154 --> 00:03:16,141 architecture a little more general purpose. And then, you might be able to 45 00:03:16,141 --> 00:03:21,088 run some other programs on it. So, this is, this is really what, what 46 00:03:21,088 --> 00:03:24,831 happened. And, and as I said, the, the, one of the 47 00:03:24,831 --> 00:03:31,583 first people that tried use this and were trying to program it in the, the pixel 48 00:03:31,583 --> 00:03:37,441 shaders and the, the, the per vector per vertex computations which were also sort 49 00:03:37,441 --> 00:03:41,767 of baked into these original architectures were very hard. 50 00:03:41,767 --> 00:03:46,067 because you're basically trying to think about everything as a picture in a frame, 51 00:03:46,067 --> 00:03:48,713 and then sort of back compute what was going on. 52 00:03:48,713 --> 00:03:53,179 But as we've moved along a little bit, we've started to see some new languages. 53 00:03:53,179 --> 00:03:57,038 So, it's new program support, and the architectures become more general 54 00:03:57,038 --> 00:04:00,211 purpose. So, this brings us to general purpose 55 00:04:00,211 --> 00:04:05,010 graphics processing units. So, it's, we still have GPU here, so it's 56 00:04:05,010 --> 00:04:09,661 still special purpose. But then, we stick GP in the front and we 57 00:04:09,661 --> 00:04:14,017 call this GPGPUs, General Purpose Graphics Processing Units. 58 00:04:14,017 --> 00:04:19,997 So, a good example of this actually is Nvidia decided to or, or came up with 59 00:04:19,997 --> 00:04:24,280 this programming language called CUDA. Now, this was not 60 00:04:26,440 --> 00:04:31,492 the first sort of foray into this. There are some research languages that 61 00:04:31,492 --> 00:04:35,229 predated this, and some ideas that predated this. 62 00:04:35,229 --> 00:04:40,628 But, the I, the idea here well we'll talk about that in a second actually. 63 00:04:40,628 --> 00:04:44,862 But the, the GPGPUs the programming model's a little bit 64 00:04:44,862 --> 00:04:47,576 strange. It's a, it's a threading model. 65 00:04:47,576 --> 00:04:52,076 And we're going to talk about that in, in a little bit more detail. 66 00:04:52,076 --> 00:04:58,075 But, first of all, I wanted to point out some differences between GPGPU and a true 67 00:04:58,075 --> 00:05:02,075 vector processor. So, in a GPGPU, there's a host CPU, which 68 00:05:02,075 --> 00:05:07,431 is your X86 processor in your computer. Then, you go across the bus, the PCIE 69 00:05:07,431 --> 00:05:11,860 bus, or PCI bus, or EGP bus out onto a graphics card. 70 00:05:13,080 --> 00:05:17,672 So, there's no, there is a host processor, but it's not, there's no 71 00:05:17,672 --> 00:05:23,269 control processor like in our vector processors that we had talked about last 72 00:05:23,269 --> 00:05:26,570 lecture. So, the scalar processor is really far 73 00:05:26,570 --> 00:05:32,382 away, and doesn't drive everything as, as strictly, it's basically having two 74 00:05:32,382 --> 00:05:37,764 processors running, the host processor, and then, the graphics processor unit. 75 00:05:37,764 --> 00:05:43,504 And because of this, you actually have to run all of your control code on the 76 00:05:43,504 --> 00:05:47,600 vector unit, somehow. So, 77 00:05:47,600 --> 00:05:51,874 this this attached host processor model does have some advantages. 78 00:05:51,874 --> 00:05:56,902 You can actually run the data parallel aspects of the program and then have the 79 00:05:56,902 --> 00:06:01,804 host processor running something else, or the SA6 processor, which is connected 80 00:06:01,804 --> 00:06:05,890 to it across the bus have it running some other, other program. 81 00:06:05,890 --> 00:06:11,250 So, let's, let's dive into detail here and talk about the Compute Unified Device 82 00:06:11,250 --> 00:06:14,690 Architecture, which is what CUDA stands for. 83 00:06:14,690 --> 00:06:20,489 So, CUDA is, is the Nvidia way of programming, or the Nvidia programming 84 00:06:20,489 --> 00:06:26,049 model for the graphics processors. And, there's a broader industrial 85 00:06:26,049 --> 00:06:32,166 accepted way to program these, which uses roughly a similar programming model 86 00:06:32,166 --> 00:06:38,680 called Open CL which is, the name there is designed to evoke notions of open GL, 87 00:06:38,680 --> 00:06:44,400 which is the, the graphics language that is widely used for 3D rendering. 88 00:06:45,420 --> 00:06:51,421 So, you have to suspend disbelief here for a 89 00:06:51,421 --> 00:06:55,202 second. And, and this, this model is a little bit 90 00:06:55,202 --> 00:06:58,984 odd, I think. But, let's, let's, let's talk about what, 91 00:06:58,984 --> 00:07:00,800 what CUDA is. So, in CUDA, 92 00:07:04,040 --> 00:07:09,047 let's say, we have a loop here. let's, let's say, a non-CUDA program 93 00:07:09,047 --> 00:07:12,600 first. It's the upper portion of this code. 94 00:07:12,600 --> 00:07:24,196 We're trying to take y plus or, or y of i plus x of i times some scalar value. 95 00:07:24,196 --> 00:07:33,384 So, this is the traditional A times another vector plus a third vector. 96 00:07:33,384 --> 00:07:42,280 This is actually a inner loop of this shows up an inner loop of impact. 97 00:07:42,280 --> 00:07:47,000 So, is that, which is a, a benchmark that a lot of people run. 98 00:07:48,320 --> 00:07:52,193 So, pretty soon what we were doing before. You're adding one vector to 99 00:07:52,193 --> 00:07:56,612 another vector, except that you're taking one of the vectors and multiplying by a 100 00:07:56,612 --> 00:07:58,640 scalar. Pretty, pretty simple. 101 00:07:58,640 --> 00:08:05,246 So, in CUDA, the, the basic idea is that you don't do a lot of operation, you 102 00:08:05,246 --> 00:08:11,811 don't do a lot of work per, what we're going to call threads. 103 00:08:11,811 --> 00:08:21,720 So, what we're going to do here is we're going to define a block of data, 104 00:08:21,720 --> 00:08:28,813 and we use some special tabs on here to say what's going on. 105 00:08:28,813 --> 00:08:38,852 And, [COUGH] this block has some size. And then, what we do is, we define a 106 00:08:38,852 --> 00:08:47,059 thread that can operate in parallel times these 256, or there's 256 data values and 107 00:08:47,059 --> 00:08:52,337 256 threads. So, what's going to happen here is we're 108 00:08:52,337 --> 00:09:01,236 basically going to do the same operation here, y times y of i plus a times x of i, 109 00:09:01,236 --> 00:09:09,580 and we're going to store it into here. But, let's look where i comes from. 110 00:09:10,820 --> 00:09:19,610 i comes from some special keywords here where it computes which thread number you 111 00:09:19,610 --> 00:09:25,640 are or which index number you effectively are in this thread block. 112 00:09:25,640 --> 00:09:31,502 [COUGH] And then, this if statement here is the moral equivalent of our strip 113 00:09:31,502 --> 00:09:35,170 mining loop. So, this says, if we get above how ever 114 00:09:35,170 --> 00:09:39,866 many, however much work we're trying to do, don't do anymore work. 115 00:09:39,866 --> 00:09:44,048 So, we can block it into, let's say, some number of threads. 116 00:09:44,048 --> 00:09:50,064 And then, we can actually pass into an n where n is the amount of work to actually 117 00:09:50,064 --> 00:09:54,980 do or the length of the vector, and we can check that along the way. 118 00:09:56,980 --> 00:10:02,707 What I find interesting about this is, if you look at this on first view, it looks 119 00:10:02,707 --> 00:10:06,760 like each of these threads, oops, is completely independent. 120 00:10:09,780 --> 00:10:14,992 So, that is the programming model, is that each of the threads are independent. 121 00:10:14,992 --> 00:10:20,475 Unfortunately, the computer architecture that this is going to run on, each of the 122 00:10:20,475 --> 00:10:25,078 threads are not independent. So, in CUDA, you don't want to have these 123 00:10:25,078 --> 00:10:28,201 diverge. It's allowed to have them diverge. 124 00:10:28,201 --> 00:10:31,457 So, for instance, there's a, there's a if statement in here. 125 00:10:31,457 --> 00:10:35,398 So, you're going to have one go into the if statement, and one, one, a different 126 00:10:35,398 --> 00:10:37,797 thread follow through on the if statement. 127 00:10:37,797 --> 00:10:41,452 So, it is, it is allowed. But if you do that, you're basically just 128 00:10:41,452 --> 00:10:44,860 not using one of the pipelines for a while. 129 00:10:44,860 --> 00:10:50,466 So, I should of come back to this. Lets, let's talk about how this 130 00:10:50,466 --> 00:10:55,281 programming model shows up. So, the programming model is having lots 131 00:10:55,281 --> 00:11:01,102 of different little threads and the idea is you make sort of these micro threads. 132 00:11:01,102 --> 00:11:06,995 And then, you want to take these micro threads and the run time system plus the 133 00:11:06,995 --> 00:11:10,444 compiler, the CUDA compiler, will put it together. 134 00:11:10,444 --> 00:11:15,690 And actually put all the threads together, and actually have them operate 135 00:11:15,690 --> 00:11:19,140 on the exact same instructions at the same time. 136 00:11:20,360 --> 00:11:28,460 So, they're hiding a single instruction multiple data architecture under the hood 137 00:11:28,460 --> 00:11:35,035 of these threads. So, if we look at our example here, we do 138 00:11:35,035 --> 00:11:39,921 a load. So, multiply a different load add in the 139 00:11:39,921 --> 00:11:43,540 store, and that's, that's our a, x 140 00:11:43,540 --> 00:11:52,960 a times x plus y operation going on here. And, across the other way, 141 00:11:52,960 --> 00:11:57,036 these are all these threads are doing the same operation or effectively the same 142 00:11:57,036 --> 00:12:02,726 operation. So, they call these single instruction, 143 00:12:02,726 --> 00:12:05,881 multiple thread, which is kind of a funny idea. 144 00:12:05,881 --> 00:12:10,480 In reality, it's actually single instruction multiple data. 145 00:12:10,480 --> 00:12:15,686 But, they introduced this notion of threads with predication into the thread 146 00:12:15,686 --> 00:12:20,550 in, in, into the single structure multiple data to allow, effectively, one 147 00:12:20,550 --> 00:12:25,600 pipe, one microthread here to do something slightly different, 148 00:12:25,600 --> 00:12:33,764 by predication. So, what is the implications of the 149 00:12:33,764 --> 00:12:37,220 single instruction multiple thread? Well, 150 00:12:38,820 --> 00:12:44,880 strangely enough, because you have it's hard to control the order of the threads 151 00:12:44,880 --> 00:12:50,094 relative to each other with the data, the memory system has to support all 152 00:12:50,094 --> 00:12:55,450 different notions and all different alignments of scatter-gather operations. 153 00:12:55,450 --> 00:13:00,320 So, they don't actually try to control the addressing because each of the 154 00:13:00,320 --> 00:13:04,256 threads could potentially try to do some scatter operation. 155 00:13:04,256 --> 00:13:09,526 So instead, what they do is, they have some really smart intelligence that takes 156 00:13:09,526 --> 00:13:13,195 all the addresses that come out of the execution units. 157 00:13:13,195 --> 00:13:18,332 And they say, oh, these look like they should line up and we'll issue these at 158 00:13:18,332 --> 00:13:22,068 the same time. So, if you happen to have threads which 159 00:13:22,068 --> 00:13:26,624 all try to do, let's say, a of i, we'll say, where i is a thread 160 00:13:26,624 --> 00:13:30,198 number, then you have units trying operations. 161 00:13:30,198 --> 00:13:36,228 If you try to pack them all together and the hardware actually goes and tries to 162 00:13:36,228 --> 00:13:43,540 figure this out. And, as I had mentioned before, if you, 163 00:13:43,540 --> 00:13:48,235 you have to use predication here if you have different control flow. 164 00:13:48,235 --> 00:13:52,930 So, you need strong, strong use of predication to allow threads to go 165 00:13:52,930 --> 00:13:59,748 different directions. So, things get even more complicated in 166 00:13:59,748 --> 00:14:06,202 these architectures. These GPU, general purpose GPU architectures 167 00:14:06,202 --> 00:14:11,064 and take the word warp here and replace it with thread. 168 00:14:11,064 --> 00:14:18,049 so unfortunately, if you, if you go read your textbook, they have a nice table 169 00:14:18,049 --> 00:14:24,277 which translates GPGPU nomenclature to the whole rest of computer architecture 170 00:14:24,277 --> 00:14:27,481 nomenclature. But, the GPU people came up with 171 00:14:27,481 --> 00:14:32,892 completely different names for everything which is just kind of annoying. 172 00:14:32,892 --> 00:14:36,096 Because names already existed for everything. 173 00:14:36,096 --> 00:14:39,956 So, if you go look inside of one these GPUs, 174 00:14:39,956 --> 00:14:47,249 they actually are a massively parallel architecture with multiple lanes, like 175 00:14:47,249 --> 00:14:52,627 our vector architecture. And then, on top of that, they are a 176 00:14:52,627 --> 00:14:58,033 multi-threaded architecture. Typically, these architectures don't have 177 00:14:58,033 --> 00:15:02,341 caches. So, to hide a lot of the memory latency what they'll do is you'll take 178 00:15:02,341 --> 00:15:06,705 all the threads that are active in the machine, and this is part of the reason 179 00:15:06,705 --> 00:15:11,237 they have this threading model, is you'll take all the threads that are active in 180 00:15:11,237 --> 00:15:15,265 the machine and you'll schedule one thread. And if that thread, let's say, 181 00:15:15,265 --> 00:15:19,115 misses out to memory, you'll time slice it out and then 182 00:15:19,115 --> 00:15:23,445 schedule a different thread. So, that actually we'll fine grain 183 00:15:23,445 --> 00:15:31,960 interweave threads on a functional unit. So, it's a strange idea here mixing 184 00:15:31,960 --> 00:15:40,233 multi-string with SIM D at the same time. So, lots of different parallels and 185 00:15:40,233 --> 00:15:47,257 aspects coming together in these GPGPUs. I don't want to go into that much detail 186 00:15:47,257 --> 00:15:51,555 because then you have a whole class on how to program GPGPU's. 187 00:15:51,555 --> 00:15:57,150 But basic idea I wanted to get across is that they are a multi-threaded single 188 00:15:57,150 --> 00:16:02,540 instruction multiple-data machine, but they overlay on top of that, this strange 189 00:16:02,540 --> 00:16:06,292 notion of threads. And, but, the threads don't do exactly 190 00:16:06,292 --> 00:16:09,226 the same work because it's a SIM D machine. 191 00:16:09,226 --> 00:16:18,825 you basically just end up wasting slots. So, these have a lot of performance. 192 00:16:18,825 --> 00:16:21,818 So, some examples of this the Nvidia, 193 00:16:21,818 --> 00:16:28,495 this is actually the modern day, Nvidia computer archi, or GPU architecture you 194 00:16:28,495 --> 00:16:31,104 can go buy. They call it the Fermi. 195 00:16:31,104 --> 00:16:36,169 this is in that card I showed last time with the, the, the Tesla. 196 00:16:36,169 --> 00:16:42,840 let's zoom in on one of these and talk about what's inside of here. 197 00:16:43,920 --> 00:16:48,720 So, roughly what's going on is they actually have, 198 00:16:48,720 --> 00:16:53,520 well, first of all, let's see the stuff that's not programmable, which is 199 00:16:53,520 --> 00:16:56,901 actually a significant portion of the design here. 200 00:16:56,901 --> 00:17:01,904 If you look down here, they have vertex shaders and tessellation units and 201 00:17:01,904 --> 00:17:06,299 texture mapping units, texture, texture caches And really, what 202 00:17:06,299 --> 00:17:09,748 this is all for, is this if for graphics processing. 203 00:17:09,748 --> 00:17:13,669 [LAUGH] And then, [LAUGH] we sort of smush onto that, some array of general 204 00:17:13,669 --> 00:17:16,780 purpose units or mildly general purpose units. 205 00:17:18,880 --> 00:17:24,854 And inside of each one of these cores here, there's a floating point unit and 206 00:17:24,854 --> 00:17:29,588 an integer unit. And if you cut this way, that's actually 207 00:17:29,588 --> 00:17:34,631 each one of these here what they call core, is effectively a lane. 208 00:17:34,631 --> 00:17:40,296 So, they're replicated in that direction. And then, one, two, three, four, five, 209 00:17:40,296 --> 00:17:43,400 six, seven, eight, this direction is SIM D. 210 00:17:44,560 --> 00:17:53,040 So, they basically have a SIM D architecture with multiple lanes. 211 00:17:53,040 --> 00:17:58,570 So, lots of parallelism going on. And then, at the top here, they have what 212 00:17:58,570 --> 00:18:04,024 they call the warp scheduler which is the thread scheduler, which will assign 213 00:18:04,024 --> 00:18:08,740 instructions down into the different parallel units. 214 00:18:08,740 --> 00:18:13,478 So, lots of, lots of interesting things going on in parallel here on these 215 00:18:13,478 --> 00:18:19,604 machines. So, I wanted to stop here on GPUs because 216 00:18:19,604 --> 00:18:23,288 we could have a whole another lecture on that. 217 00:18:23,288 --> 00:18:28,093 I, I don't really want to go into that level of detail on GPUs. 218 00:18:28,093 --> 00:18:34,240 But, let's switch topics and start talking about multi-threading. 219 00:18:34,240 --> 00:18:37,977 Actually, before we go off this thing, I'm sure people have questions about 220 00:18:37,977 --> 00:18:40,351 GPUs. But I'll take one or two of them, but I 221 00:18:40,351 --> 00:18:44,240 don't want to go into that much detail. I just want to sort of introduce the idea 222 00:18:44,240 --> 00:18:48,281 that you could use graphics processors. They have some similarity to vectors, but 223 00:18:48,281 --> 00:18:51,160 they're kind of the degenerate case of vector processors. 224 00:18:55,620 --> 00:19:01,136 Actually, a quick show of hands. Who, who, who has a ATI video card in 225 00:19:01,136 --> 00:19:06,415 their machine? Okay. Who has Nvidia? 226 00:19:06,415 --> 00:19:12,421 Okay. Who has Intel? Aha. So, so, interesting tidbit here. 227 00:19:12,421 --> 00:19:20,158 Everyone always thinks about ATI and Nvidia as being the sort of leaders in 228 00:19:20,158 --> 00:19:24,165 these fancy graphics cards. 229 00:19:24,165 --> 00:19:28,485 But in reality, Intel sells the most number of graphics 230 00:19:28,485 --> 00:19:32,110 processors in the world today. And it's partially because they kind of 231 00:19:32,110 --> 00:19:36,471 give them away for free, and they've effectively integrated them onto all the 232 00:19:36,471 --> 00:19:37,547 Intel chips now. So, 233 00:19:37,547 --> 00:19:42,418 it's kind of a funny thing that the least innovative the, the, the least exciting 234 00:19:42,418 --> 00:19:46,779 graphics processors out there are, are kind of there not because they're good, 235 00:19:46,779 --> 00:19:50,573 but because they're cheap. Lots of, lots of things in the world work 236 00:19:50,573 --> 00:19:51,140 like that.