1 00:00:03,027 --> 00:00:05,881 Okay. So, an important question comes up with 2 00:00:05,881 --> 00:00:11,733 something like a superscalar when you're executing multiple executions at a time is 3 00:00:11,733 --> 00:00:16,891 what happens when you fetch two instructions and you, lets say one of 4 00:00:16,891 --> 00:00:23,207 those instructions takes an interrupt or exception as its going down the pipeline. 5 00:00:23,207 --> 00:00:28,459 So, let's, let's take an example here. Let's say, we have a load and then a 6 00:00:28,459 --> 00:00:32,590 system-call instruction. Now, both these instructions can 7 00:00:32,590 --> 00:00:38,021 effectively take interrupts or exceptions. The, the load can take something like a 8 00:00:38,021 --> 00:00:42,046 TOB miss, or alignment fault. The SYSCALL instruction, by definition, is 9 00:00:42,046 --> 00:00:47,010 effectively, making an interrupt occur. And one of the interesting questions here 10 00:00:47,010 --> 00:00:49,096 is this load word, which is going to go down the pipe. 11 00:00:49,096 --> 00:00:54,017 If, we fetch these two at the same time, and they start marching down the pipes. 12 00:00:54,017 --> 00:00:57,090 This is our pipeline diagram. We fetch at the same time, we decode at 13 00:00:57,090 --> 00:01:00,081 the same time. The load has to go to the A-pipe or the 14 00:01:00,081 --> 00:01:05,008 load has to go to the B-pipe, so it ends up in B, and the SYSCALL ends up in the 15 00:01:05,008 --> 00:01:10,132 A-pipe. Well, what does this exactly mean if the 16 00:01:10,132 --> 00:01:17,081 load is in the B pipeline, but it takes an interrupt, and it commits in order first. 17 00:01:18,031 --> 00:01:22,043 Hm, well, actually, let's, let's think about that even, let's, let's think of 18 00:01:22,043 --> 00:01:26,055 even a, a simpler question here. What if the SYSC will load does not take 19 00:01:26,055 --> 00:01:37,805 any faults and the SYSCALL takes a fault. Which happened first, A or B? 20 00:01:37,805 --> 00:01:41,226 Okay. So, what should, should happen first in a 21 00:01:41,226 --> 00:01:43,800 program order? A load and then, an instruction after a 22 00:01:43,800 --> 00:01:45,940 load. A load should happen first because in 23 00:01:45,940 --> 00:01:53,319 program order we go sort of top to bottom. But, the load is in our, our B-pipe here 24 00:01:53,319 --> 00:01:59,775 and our A-pipe, we have a instruction which takes a interrupt. 25 00:01:59,775 --> 00:02:07,493 Well, so, what happens here, right, is that the load should go down the pipe and 26 00:02:07,493 --> 00:02:13,155 complete in order not to deadlock your, basically, your code logic is going to 27 00:02:13,155 --> 00:02:18,563 sort of either know about this or very, very late in the pipe, you have to have 28 00:02:18,563 --> 00:02:22,747 what we're going to call a commit point, which we're going to be talking about 29 00:02:22,747 --> 00:02:27,073 later in today's lecture, and you have to make some rational decision here of which 30 00:02:27,073 --> 00:02:30,712 of these actually occurred in order and you somehow have to track that going down 31 00:02:30,712 --> 00:02:33,517 the pipe. And then, you have to make a decision of 32 00:02:33,517 --> 00:02:37,888 oh, well, the A-pipe actually just took an interrupt, but we, in program order, the 33 00:02:37,888 --> 00:02:40,408 B-pipe is the first instruction going down the pipe. 34 00:02:40,408 --> 00:02:44,146 So, at the end of the pipe there, you're going to have to make some logical 35 00:02:44,146 --> 00:02:49,223 decision, and have to have a little bit of control logic to make sure that you're not 36 00:02:49,223 --> 00:02:53,569 going to, let's say, take the interrupt for the SYSCALL, even though, and, and 37 00:02:53,569 --> 00:02:56,078 kill the load construction before the SYSCALL. 38 00:02:56,078 --> 00:03:00,556 So, one thing you could do is actually have both of them go down the end of the 39 00:03:00,556 --> 00:03:05,047 pipe, not kill load, but a commit. And have the SYSCALL actually take the 40 00:03:05,047 --> 00:03:08,029 interrupts. That's probably the highest performing 41 00:03:08,029 --> 00:03:12,004 thing you can do in this case. Lower performing things probably would be 42 00:03:12,004 --> 00:03:16,003 easier but that's probably the highest performing thing you can do in this case. 43 00:03:17,068 --> 00:03:20,587 Okay. So, we sort of introduce this two-way 44 00:03:20,587 --> 00:03:27,044 super scalarr in, or superscalar. One thing we need to think about is we add 45 00:03:27,044 --> 00:03:33,057 a lot more places that data could be coming from in, if we were to forward 46 00:03:33,057 --> 00:03:37,524 data. So, we now, instead of if, if, when we 47 00:03:37,524 --> 00:03:42,368 have one pipeline, we could bypass them out of here, here, and there. 48 00:03:42,368 --> 00:03:46,503 So, it's only three places. But now that we have two pipelines, you 49 00:03:46,503 --> 00:03:51,585 effectively multiply the places that you can bypass out of, and now you've got six 50 00:03:51,585 --> 00:03:54,096 places. So, if you go sort of pull, pull steering 51 00:03:54,096 --> 00:03:58,644 logic off and then, you know, make your multiplexors bigger here which are, you're 52 00:03:58,644 --> 00:04:03,137 doing, you're bypassing, you end up with six different locations that you have to 53 00:04:03,137 --> 00:04:07,363 choose between for each input operand. And this is a relatively short pipe. 54 00:04:07,363 --> 00:04:12,039 So, as you start to go to bigger and bigger pipelines either in depth or in 55 00:04:12,039 --> 00:04:17,269 width, and you want full bypassing, you're going to have more and more, much wider 56 00:04:17,269 --> 00:04:20,742 multiplexers here and a lot more data being bypassed. 57 00:04:20,742 --> 00:04:24,546 So, this, this actually becomes, becomes a problem that you need to start to, to 58 00:04:24,546 --> 00:04:28,543 think about this really hard. So, some, what are some solutions to this? 59 00:04:28,543 --> 00:04:32,937 Well, one solution that people sometimes do, is they don't have full bypassing. 60 00:04:32,937 --> 00:04:36,227 You can only bypass at certain locations. That's, that's one option. 61 00:04:36,227 --> 00:04:41,046 Another option, which we will be talking about a little bit later today, is you can 62 00:04:41,046 --> 00:04:45,553 actually, maybe, not actually have this pipeline register and if you start to 63 00:04:45,553 --> 00:04:49,794 think of having out of order processors, you could start to think of committing 64 00:04:49,794 --> 00:04:52,341 information back to the register file early. 65 00:04:52,341 --> 00:04:55,482 So, this pipe here has nothing happening in this stage. 66 00:04:55,482 --> 00:04:57,879 Well, couldn't we just shove it in the register file? 67 00:04:57,879 --> 00:05:03,320 On first appearance that sounds great, you start to think about that a little more 68 00:05:03,320 --> 00:05:07,554 and you're, you can start to get worried here because you start to see write after 69 00:05:07,554 --> 00:05:11,751 write hazards actually start showing up as real problems then, because if you issue 70 00:05:11,751 --> 00:05:16,574 an instruction here, which writes to the same register or which happens at the end 71 00:05:16,574 --> 00:05:20,586 of this load operation, we'll say, then you could actually get out of order 72 00:05:20,586 --> 00:05:23,985 writing to the register file. So, you need to be cognizant of that. 73 00:05:23,985 --> 00:05:28,948 Other approaches that people take this, is sometimes they will actually have what is 74 00:05:28,948 --> 00:05:32,555 called clustered superscalars. So, clustered superscalars though, 75 00:05:32,555 --> 00:05:36,431 superscalars will actually have, let's say, four pipelines, and they'll cluster 76 00:05:36,431 --> 00:05:40,777 them into two pipelines of two each, and you'll allow bypassing between two of the 77 00:05:40,777 --> 00:05:45,507 pipes, and two of the other pipes, and if you bypass between the two sets of 78 00:05:45,507 --> 00:05:50,265 clustered pipes, then it takes an extra cycle or you'd have to do it through the 79 00:05:50,265 --> 00:05:55,054 register file or something like that. So, there's other approaches there to try 80 00:05:55,054 --> 00:05:59,010 and mitigate the blowup of this bypassing network. 81 00:05:59,010 --> 00:06:06,016 You have to remember, in something like a 64-bit, something like a 64-bit processor, 82 00:06:06,016 --> 00:06:10,004 each of these are 64-bit buzzes. Each, each one of these little wires here, 83 00:06:10,004 --> 00:06:12,078 so these things get pretty, pretty big, pretty quick. 84 00:06:12,078 --> 00:06:16,098 So, you have to worry about actually running theses things around the chip, cuz 85 00:06:16,098 --> 00:06:21,007 all of a sudden you have hundreds and hundreds of, you know, 600 bits running 86 00:06:21,007 --> 00:06:25,058 over [unknown] wires running up and over just for your bypass from this simple 87 00:06:25,058 --> 00:06:28,098 pipeline, and if we go wider, it's going to be much worse, or longer. 88 00:06:32,013 --> 00:06:37,028 So, one, one thing that people do a lot to handle this bypassing, for a critical path 89 00:06:37,028 --> 00:06:42,616 perspective, cuz this takes a long time or starts to take a long time is you start to 90 00:06:42,616 --> 00:06:47,090 break the decode and the issue. So, we're going to sort of get away from 91 00:06:47,090 --> 00:06:53,058 our five-stage pipes now, up to this point we've been doing things you've seen in the 92 00:06:53,058 --> 00:06:57,289 first Patterson-Hennessy book. And now, we're going to start thinking 93 00:06:57,289 --> 00:07:02,087 about things that have longer pipelines. So, one, one good thing to do is you can 94 00:07:02,087 --> 00:07:07,078 actually break the decode and the register file access into two separate stages in 95 00:07:07,078 --> 00:07:10,092 the pipe effectively making a six-stage pipeline now. 96 00:07:10,092 --> 00:07:14,618 And what do we, what do we put? Well, one thing we can do is actually 97 00:07:14,618 --> 00:07:19,540 break the decode into its own pipe stage and we can try to figure out structural 98 00:07:19,540 --> 00:07:24,007 hazards even in that pipe stage. That's when people sort of traditionally 99 00:07:24,007 --> 00:07:27,807 do the decode and they also look to see if you're going to have structural hazard, 100 00:07:27,807 --> 00:07:32,054 let's say, in the right port of the register file that ends the pipe. 101 00:07:33,033 --> 00:07:38,846 And then, in the issue stage, I for issue, you'll do the register file and you'll 102 00:07:38,846 --> 00:07:44,796 probably swizzle or cross over or steer the instructions to the, and the operands 103 00:07:44,796 --> 00:07:49,099 to the correct location. And, of course, you'll do this bypassing 104 00:07:49,099 --> 00:07:53,057 if you have lots of bypassing operands going back. 105 00:07:53,057 --> 00:07:59,130 So, to give u a sort of a, a brief pipeline example here, here we have two 106 00:07:59,130 --> 00:08:03,598 cycles and, each, execute two instructions per cycle. 107 00:08:03,598 --> 00:08:10,577 And, we can see now, our pipeline has an extra I in here, which is just an extra 108 00:08:10,577 --> 00:08:12,429 front end stage. Okay. 109 00:08:12,429 --> 00:08:16,020 So, this has some, this has some negative aspects. 110 00:08:16,020 --> 00:08:21,085 Can anyone think of one negative aspects of putting extra pipe line stages in the 111 00:08:21,085 --> 00:08:23,071 front of our pipe line? Yes. 112 00:08:23,071 --> 00:08:29,076 So, branches, if you, if you, if we know that the branch gets resolved in outside 113 00:08:29,076 --> 00:08:34,833 of, out of the first execute stage of the pipe or in our two pipe here, [unknown] 114 00:08:34,833 --> 00:08:37,643 A0. We've just increased the branch cost by, 115 00:08:37,643 --> 00:08:41,077 by one. So now, something that would have branched 116 00:08:41,077 --> 00:08:46,520 or, a, a branch, you know, mispredict probably of, let's say, two cycles just 117 00:08:46,520 --> 00:08:50,092 became three cycles. And this, this can start hurting your 118 00:08:50,092 --> 00:08:53,098 performance. And, this really starts to hurt your 119 00:08:53,098 --> 00:08:59,008 performance, as you start to go wide. So, let's take this instruction sequence 120 00:08:59,008 --> 00:09:04,046 here, where we have this extra issue stage on front and we have a branch in the first 121 00:09:04,046 --> 00:09:07,374 instruction. We try to execute and then we just have 122 00:09:07,374 --> 00:09:12,036 sort of fall through code here, which we sort of, predict fall through. 123 00:09:12,036 --> 00:09:17,671 We don't realize that the branch happens till A0, at that point, we can redirecting 124 00:09:17,671 --> 00:09:21,065 to sort of kill everything that we've already done in flight. 125 00:09:21,065 --> 00:09:24,821 But look at all the things that have gone in flight already. 126 00:09:24,821 --> 00:09:29,857 We're, we're sitting here, which means we've had one, two, three other stages if 127 00:09:29,857 --> 00:09:33,704 you will, or three other cycles to go fetch instructions. 128 00:09:33,704 --> 00:09:37,971 So, we fetch these, these, these, we decode them, we did, we spent a lot of 129 00:09:37,971 --> 00:09:42,596 power, we spent a lot of time, we spent a lot of fetch bandwidth doing this. 130 00:09:42,596 --> 00:09:47,592 And then, we just kill it all and re-vector to the correct branch target. 131 00:09:47,592 --> 00:09:51,382 So, in this example here, we've killed seven instructions. 132 00:09:51,382 --> 00:09:56,497 So, this can have a, a pretty negative impact on our clocks per instruction, if 133 00:09:56,497 --> 00:10:01,301 you will. So, let's, let's talk briefly about how to 134 00:10:01,301 --> 00:10:05,018 fix this. We're not going to fix all this today, we 135 00:10:05,018 --> 00:10:08,351 have a whole dedicated lecture to fixing this. 136 00:10:08,351 --> 00:10:14,293 But, what could we possibly do to minimize the probability that there is, all these 137 00:10:14,293 --> 00:10:19,919 dead cycles here, all these killed instructions going down our 2A pipe. 138 00:10:19,919 --> 00:10:25,571 Well, we can, we can, hopefully, if we are lucky, we can try to predict the 139 00:10:25,571 --> 00:10:30,494 destination with some accuracy. And have a branch predictor which will 140 00:10:30,494 --> 00:10:35,227 figure out where the destination of the actual branch is with some high 141 00:10:35,227 --> 00:10:38,568 probability. And then, instead of executing, let's say, 142 00:10:38,568 --> 00:10:43,871 Op A here which is a, a dead instruction doing incorrect branch target, we can try 143 00:10:43,871 --> 00:10:47,299 to fetch and try to execute the correct branch part. 144 00:10:47,299 --> 00:10:52,354 And we're going to have a whole talk, we're going to have a whole lecture on how 145 00:10:52,354 --> 00:10:57,690 to get your branch prediction accuracy up. So, in, in modern day processors, there's, 146 00:10:57,690 --> 00:11:01,387 you know, somewhere around 98%accurate, give or take a little bit. 147 00:11:01,387 --> 00:11:06,318 I actually don't know what the state of the art is on this because they, they, 148 00:11:06,318 --> 00:11:10,786 keep getting better and more complex branch predictors, but there's pretty 149 00:11:10,786 --> 00:11:14,710 simple things you can do to sort of get you into the mid '80s range. 150 00:11:14,710 --> 00:11:19,718 And then to get from, from the mid '80s range of branch prediction accuracy to the 151 00:11:19,718 --> 00:11:23,321 '90s, you have to sort of put a lot of effort and, and time into that. 152 00:11:23,321 --> 00:11:27,397 But, we'll have a whole lecture later in today's, later in the, the course, about 153 00:11:27,397 --> 00:11:30,424 branch prediction, just dedicated to branch prediction. 154 00:11:30,424 --> 00:11:34,219 But I just want to motivate here that, if you have a longer front end of your pipe, 155 00:11:34,219 --> 00:11:38,731 and we are going to look at some pipelines where there is even more front end stages 156 00:11:38,731 --> 00:11:43,264 than just fetch decode issue before the branch gets resolved, whenever you add an 157 00:11:43,264 --> 00:11:47,819 extra pipe stage in the front, that's going to impact your performance. 158 00:11:47,819 --> 00:11:53,058 Because even if you have a high prediction accuracy, your prediction accuracy is not 159 00:11:53,058 --> 00:11:56,004 going to be 100%. And, if you mispredict, you sure going to 160 00:11:56,004 --> 00:12:00,383 have dead instructions going on the pipe, as this can be wasting time, energy, and 161 00:12:00,383 --> 00:12:02,053 utilization of the pipeline.