1 00:00:03,087 --> 00:00:07,054 Okay. So that gets us out of the outcome, part 2 00:00:07,054 --> 00:00:12,034 of today's lecture. Let's start talking about how to figure 3 00:00:12,034 --> 00:00:19,076 out where we're going. So it brings us back to this picture and I 4 00:00:19,076 --> 00:00:25,005 took some stuff off this picture now. Well now what we need to do is we need to 5 00:00:25,005 --> 00:00:31,034 know the targets, the role of targets. And I said we don't know the target for 6 00:00:31,034 --> 00:00:35,057 branch, a jump, a jump and link until we actually do PCE plus the offset 7 00:00:35,057 --> 00:00:40,057 calculation somewhere in the decode stage and until we at least get the instruction. 8 00:00:41,017 --> 00:00:47,017 From jump registers and jump [inaudible] registers, we don't know that until later. 9 00:00:47,017 --> 00:00:53,004 But this is peeking out, this is not looking at the, the outcome, this is just 10 00:00:53,004 --> 00:00:57,037 looking at when do we know the value. But this is important because, even though 11 00:00:57,037 --> 00:01:01,036 we have the prediction of where we're whether it's taken or not taken. 12 00:01:01,036 --> 00:01:03,088 And we can do that let's say, with 99 percent accuracy. 13 00:01:03,088 --> 00:01:08,026 This is the harder one, we're trying to pick from two to the 32 possible numbers. 14 00:01:08,026 --> 00:01:13,078 Can we actually do a good job here. So something not on the slides, but 15 00:01:13,078 --> 00:01:20,046 something I wanted to talked about briefly here is we can use a static algorithm 16 00:01:20,046 --> 00:01:26,573 similar to what we have seen with the other predictors to do this calculation 17 00:01:26,573 --> 00:01:32,582 through the compiler if you will. So there's been a few architectures, one 18 00:01:32,582 --> 00:01:39,593 example actually is the Hitachi, or now Renaissance SH5 architecture which is sort 19 00:01:39,593 --> 00:01:47,085 of a, a media processor, we'll say. And they actually have, an instruction. 20 00:01:48,029 --> 00:01:54,083 Which loads the target of a future branch. So they decouple the branch into two 21 00:01:54,083 --> 00:02:01,036 instructions, the actual branch and a prepared branch, which has data to execute 22 00:02:01,036 --> 00:02:08,015 earlier in their instruction sequence and says, I'm playing to do a branch and its 23 00:02:08,015 --> 00:02:13,049 address something. It's actually called prepared target, is 24 00:02:13,049 --> 00:02:16,065 the name of there, of this instruction, or PTA. 25 00:02:16,065 --> 00:02:21,080 Under architecture and because of this, they don't have this problem. 26 00:02:21,080 --> 00:02:26,780 They, all they need to do is actually do the branch. 27 00:02:26,780 --> 00:02:31,560 So, they know the target of branch. In the fetch stage, they can actually just 28 00:02:31,560 --> 00:02:36,562 go because they have this instruction that says, load up this special register with 29 00:02:36,562 --> 00:02:41,545 a, a, my target to the branch, and then we decouple the actual taking of the branch 30 00:02:41,545 --> 00:02:44,512 sometimes in the future. So this is a, a, an interesting idea. 31 00:02:44,512 --> 00:02:48,338 I thought I am pretty sure there was actually a, a one of the CDC machines that 32 00:02:48,338 --> 00:02:52,430 controls the [inaudible] machine. Did this a long time ago also but I 33 00:02:52,430 --> 00:02:55,942 couldn't find the reference in time for today's lecture. 34 00:02:55,942 --> 00:02:58,462 But I know the SH it's actually SH5 actually. 35 00:02:58,462 --> 00:03:01,992 Definitely has this instruction. So all I want to try to get at here is 36 00:03:01,992 --> 00:03:05,194 that there are static software ways to go about doing this. 37 00:03:05,194 --> 00:03:09,445 In addition now let's talk about something [inaudible]. 38 00:03:09,445 --> 00:03:14,078 Okay. So, the most common way to go do this is 39 00:03:14,078 --> 00:03:19,258 to implement something called a branch target buffer. 40 00:03:19,258 --> 00:03:25,248 In a branch target buffer, you actually do in parallel with both the branch 41 00:03:25,248 --> 00:03:30,364 prediction outcome or the branch outcome prediction and the PC plus four. 42 00:03:30,364 --> 00:03:35,563 So you, you put into it. The current PC and it's going to tell you 43 00:03:35,563 --> 00:03:39,646 somehow one out of two to the 32 possible targets. 44 00:03:39,646 --> 00:03:44,526 How does it do this? Well, what it does is it actually just 45 00:03:44,526 --> 00:03:49,577 keeps a table of where that branch has gone in the past. 46 00:03:49,577 --> 00:03:54,009 So the first time it executes, it just gets it wrong. 47 00:03:54,009 --> 00:03:59,055 But then the second time it exe, but given that first time it executes, it's actually 48 00:03:59,055 --> 00:04:04,029 gonna load the table. With a predicted target. 49 00:04:04,029 --> 00:04:11,705 So we're gonna load the predicted target, based on where this branch went in the 50 00:04:11,705 --> 00:04:14,566 past. We're gonna index into this table, based 51 00:04:14,566 --> 00:04:19,869 on our current program counter, and then we're going to check, to see that we 52 00:04:19,869 --> 00:04:23,486 actually match. So we're gonna take the whole PC, we're 53 00:04:23,486 --> 00:04:28,636 gonna do a tag check, very similar to our cache, cause this is effectively a cache. 54 00:04:28,636 --> 00:04:34,250 And if. Are, we hitting this, small cache here, of 55 00:04:34,250 --> 00:04:39,636 targets. We have, the new program counter. 56 00:04:39,636 --> 00:04:44,674 If we predicted, taken. If we predicted, not taken, we 57 00:04:44,674 --> 00:04:48,576 [inaudible], we just wanna use PC plus four. 58 00:04:48,576 --> 00:04:55,180 So, in effect here, we can actually Predict the target of the branch based on 59 00:04:55,180 --> 00:05:01,540 where this branch has gone in the past. What, that is a lot easier, I thought this 60 00:05:01,540 --> 00:05:05,619 was gonna be really hard. Well, it's actually not that bad. 61 00:05:05,619 --> 00:05:10,735 One thing I should say is the size of this table kind of matters. 62 00:05:10,735 --> 00:05:16,697 If you alias too many places into here you may end up just pulling out wrong data or 63 00:05:16,697 --> 00:05:19,547 you end up might having a miss quite often. 64 00:05:19,547 --> 00:05:22,840 So you want to size this table appropriately. 65 00:05:22,840 --> 00:05:27,590 But you don't want to make it too big, because you're, you're carrying the entire 66 00:05:27,590 --> 00:05:30,068 program counter in here. And that can get big. 67 00:05:30,068 --> 00:05:36,841 So the data in here is quite, quite large. So one way to solve this actually is 68 00:05:36,841 --> 00:05:41,372 typically people try to make these things set associative. 69 00:05:41,372 --> 00:05:46,649 So you'll actually have multiple. Pcs and predicted targets because that is 70 00:05:46,649 --> 00:05:51,923 actually a little better performance if you can check multiple and parallel versus 71 00:05:51,923 --> 00:05:55,287 making it larger. I mean you could just make it larger but 72 00:05:55,287 --> 00:05:58,908 typically having a little bit of associativity helps with this. 73 00:05:58,908 --> 00:06:05,402 But you're really executing this in parallel with the PC plus four logic. 74 00:06:05,402 --> 00:06:12,844 One thing that also makes this a little better here is the branch target buffer 75 00:06:12,844 --> 00:06:19,550 does not hold every instruction. If it's not a branch or a jump, don't put 76 00:06:19,550 --> 00:06:23,994 it in this table. So, you really only put in. 77 00:06:23,994 --> 00:06:27,521 Instructions that are branches into this table. 78 00:06:27,521 --> 00:06:35,798 Which helps, save a lot of space. Okay, so, we have a question here. 79 00:06:35,798 --> 00:06:41,082 How do we achieve this effect without decoding the instruction? 80 00:06:41,082 --> 00:06:47,330 So, we have a french target buffer, we don't wanna code the instruction. 81 00:06:47,330 --> 00:06:51,896 Can we just look in the branch, branch target buffer. 82 00:06:51,896 --> 00:06:55,662 Is this good enough? Is all the information we need in that 83 00:06:55,662 --> 00:07:02,400 table? Cuz we are executing it in parallel both 84 00:07:02,400 --> 00:07:07,180 the PC plus four. We haven't fetched the current 85 00:07:07,180 --> 00:07:10,455 instruction. Is, is that good enough or are we missing 86 00:07:10,455 --> 00:07:13,565 something? Yeah, this is kinda what I was trying to 87 00:07:13,565 --> 00:07:19,136 get at is that because we check the, the program counter here into a full bit with 88 00:07:19,136 --> 00:07:23,693 match this is good enough. We don't actually need to look anything 89 00:07:23,693 --> 00:07:26,526 else. One thing I this, this, this comment was 90 00:07:26,526 --> 00:07:30,878 here to make a point to about something called a line predictor. 91 00:07:30,878 --> 00:07:35,566 Sometimes, you wanna get rid of this valid bit and PC from before. 92 00:07:35,566 --> 00:07:40,491 And then you just have, coming out of here, instead of predicting the program 93 00:07:40,491 --> 00:07:45,678 counter, you don't really need to know the program counter, you just need to know 94 00:07:45,678 --> 00:07:50,477 where to get it from your cache. Sort of, which entry in the cache to go 95 00:07:50,477 --> 00:07:56,197 fetch, and this is called line predictor. So a full branch target buffer will give 96 00:07:56,197 --> 00:08:00,082 you an actual PC. At some point, you actually do need to 97 00:08:00,082 --> 00:08:03,492 know the PC. But you can wait to actually know the PC 98 00:08:03,492 --> 00:08:07,441 until later down the pike. Well, we, all we really need to know is 99 00:08:07,441 --> 00:08:11,137 which line in the cache, we wanna go fetch the next cycle. 100 00:08:11,137 --> 00:08:15,568 And we can do that, actually by just sort of removing this check. 101 00:08:15,568 --> 00:08:20,593 And just, ad hoc, just having it fetch whatever it gets pointed to from here. 102 00:08:20,593 --> 00:08:23,632 Which might be wrong. We have to check that. 103 00:08:23,632 --> 00:08:26,852 So anyway this is a, this a trick question. 104 00:08:26,852 --> 00:08:33,179 We can achieve this with a full BTB. A full branch target buffer. 105 00:08:33,179 --> 00:08:38,232 If we don't have. This check, it turns into something called 106 00:08:38,232 --> 00:08:40,484 line predictor. So we're predicting which line in the 107 00:08:40,484 --> 00:08:46,499 instruction memory or which line in the instruction cache to go fetch from. 108 00:08:46,499 --> 00:08:49,639 [inaudible]. Okay. 109 00:08:49,639 --> 00:09:00,604 So, this brings us to jump, jump register. So, jump register. 110 00:09:00,604 --> 00:09:04,842 Hm. This one gets harder to train. 111 00:09:04,842 --> 00:09:09,024 We can try and train it the same way. We can try to use the BTB. 112 00:09:09,024 --> 00:09:14,149 So let's look at these different cases. So what is a jump register used for? 113 00:09:14,149 --> 00:09:18,487 It's used for switch statements. It's used for C++ virtual function 114 00:09:18,487 --> 00:09:23,606 pointers, or dynamic function calls. So if you're doing a jump through function 115 00:09:23,606 --> 00:09:27,150 pointer. And it's used for returns from 116 00:09:27,150 --> 00:09:31,133 sub-routines. So how long does a BTB work in these 117 00:09:31,133 --> 00:09:36,024 different cases. So for, switch statements, if you use the 118 00:09:36,024 --> 00:09:39,910 same switch statement and the same case gets lit up, this works pretty well. 119 00:09:39,910 --> 00:09:42,510 It's kind of like training on any other predictor. 120 00:09:42,510 --> 00:09:46,575 If, if, the probability that you're actually choosing something out of the 121 00:09:46,575 --> 00:09:50,873 switch statement is data dependent, or highly data dependent, you're probably not 122 00:09:50,873 --> 00:09:54,357 gonna predict it very well. And, you're not gonna do very, there's, 123 00:09:54,357 --> 00:09:59,004 there's probably no great solution to this There's some people that do sort of, data, 124 00:09:59,004 --> 00:10:03,645 prediction and try to look at the register values to try to do this a little better, 125 00:10:03,645 --> 00:10:08,123 but it gets very hard. [inaudible] Function calls. 126 00:10:08,123 --> 00:10:13,040 Well, this works very well. So if you, if you're in a C++ function. 127 00:10:13,040 --> 00:10:17,469 In C++ they have function pointers all over the place. 128 00:10:17,469 --> 00:10:24,021 And this is to implement polymorphism. And this is how when you in C++ you sort 129 00:10:24,021 --> 00:10:31,092 of do a method access on some sort of a function your actually indexing through a 130 00:10:31,092 --> 00:10:36,905 jump table inside the data structure And it just so happens that. 131 00:10:36,905 --> 00:10:43,782 This is basically always, always correct unless you actually reassign that object 132 00:10:43,782 --> 00:10:50,899 to something else or if that object changes to a different type So, an example 133 00:10:50,899 --> 00:10:54,133 of that is if you have. Class animals. 134 00:10:54,133 --> 00:10:59,270 And you have, you know, dogs and cats, which subclass from animals. 135 00:10:59,270 --> 00:11:04,625 And you have a pointer to a animal. And you have a cat and a dog. 136 00:11:04,625 --> 00:11:10,819 And let's say you assign a cat to this function, or excuse me, this object 137 00:11:10,819 --> 00:11:16,032 pointer which is an animal. And you call some method on it, like run. 138 00:11:16,032 --> 00:11:21,518 Well that's going to train up very well, but all of a sudden if you change that cat 139 00:11:21,518 --> 00:11:25,566 to a dog. The virtual function pointer cause of what 140 00:11:25,566 --> 00:11:30,067 you actually execute on a catch when you run, run the function run on the catch 141 00:11:30,067 --> 00:11:34,142 versus running the function run on a dog.They do different things. 142 00:11:34,142 --> 00:11:37,876 Cat's like to sleep a lot. Dogs like to run around and bark. 143 00:11:37,876 --> 00:11:42,689 So you know that this the code in the middle of the run function of the cat and 144 00:11:42,689 --> 00:11:47,240 dog are different so. We, we no longer, jump to the same 145 00:11:47,240 --> 00:11:52,420 location so our BTB would be wrong but it will train up pretty fast, because you 146 00:11:52,420 --> 00:11:56,315 probably not changing the object pointer very often. 147 00:11:56,315 --> 00:12:01,685 So that's, that's, that's decent. Okay. 148 00:12:01,685 --> 00:12:08,859 So subroutine return calls. Jump registers are used there a lot. 149 00:12:08,859 --> 00:12:13,112 So you call into a function. You do it with a jump in link. 150 00:12:13,112 --> 00:12:17,993 To come back out, you come out with a jump register. 151 00:12:17,993 --> 00:12:23,455 Well. Btbs are okay, if the place that you are 152 00:12:23,455 --> 00:12:27,897 returning or excuse me, the place that you call from. 153 00:12:27,897 --> 00:12:34,762 You calling from that location locked. But, if you're calling a function from 154 00:12:34,762 --> 00:12:41,021 different locations, quite often. Effectively, what's gonna happen here is 155 00:12:41,021 --> 00:12:46,675 the one function is gonna be called from lots of different locations and you're not 156 00:12:46,675 --> 00:12:50,586 going to be able to predict the return location very accurately. 157 00:12:50,586 --> 00:12:55,950 So if one leaf function gets called from lots of different locations that's not, 158 00:12:55,950 --> 00:13:06,391 that's not particularly very good for BTB. One, One interesting thing about the 159 00:13:06,391 --> 00:13:11,901 subroutine return. Case, is that. 160 00:13:11,901 --> 00:13:17,719 The jump register the n of our leaf function. 161 00:13:17,719 --> 00:13:23,209 The program counter for that for that jump register, is not quarterly it's the 162 00:13:23,209 --> 00:13:27,015 function which called into it. So effectively when you go to index the 163 00:13:27,015 --> 00:13:31,111 BTB you're using the address of the jump register and it might point to some func, 164 00:13:31,111 --> 00:13:33,316 other function it had called it in the past. 165 00:13:33,316 --> 00:13:36,831 And then you have no correlation if somebody calls it in the future. 166 00:13:36,831 --> 00:13:44,153 And that's why this is really hard to do. So can we, can we do better for jump 167 00:13:44,153 --> 00:13:47,967 registers. Yes so the idea of a sub routine return 168 00:13:47,967 --> 00:13:53,470 call stack or sub routine return stack and the idea here is you have a small 169 00:13:53,470 --> 00:13:59,743 predictor specifically just for jump register or sub routine returns. 170 00:13:59,743 --> 00:14:08,300 And what you do is you track. Function calls, and you push the return 171 00:14:08,300 --> 00:14:13,156 address onto a stack. And then when you do a return, you use 172 00:14:13,156 --> 00:14:16,765 that prediction. So let's say we have these three functions 173 00:14:16,765 --> 00:14:19,844 here. We have function A which calls function B, 174 00:14:19,844 --> 00:14:25,257 function B which calls function C, function C calls function D. 175 00:14:25,257 --> 00:14:33,648 As denoted here. We have some stack that we're going to 176 00:14:33,648 --> 00:14:37,547 push entries onto, when we do, jump in links. 177 00:14:37,547 --> 00:14:43,767 So we, we know that there's something of a link to load this predictor with. 178 00:14:43,767 --> 00:14:50,025 So, when the function call gets executed, or the jump-in link gets executed, we're 179 00:14:50,025 --> 00:14:53,672 going to push. The address. 180 00:14:53,672 --> 00:14:59,640 Of, what is after, effectively, the function call site. 181 00:14:59,640 --> 00:15:03,320 So, I'll denote it here by the address of FB. 182 00:15:03,320 --> 00:15:07,589 So, that's for when we call, when FA calls into FB. 183 00:15:07,589 --> 00:15:13,177 In reality, it's probably like, the instruction write after that, is what 184 00:15:13,177 --> 00:15:18,071 we're gonna return to. Like lies. 185 00:15:18,071 --> 00:15:25,811 When FC then calls FC, we are going to push the return location after FC. 186 00:15:25,811 --> 00:15:30,462 And when FC calls FD, we're gonna push that. 187 00:15:30,462 --> 00:15:36,790 And these come off in the retur-, in the inverse order cuz it's a stack. 188 00:15:36,790 --> 00:15:43,168 So when we try to pull of this we're actually gonna pop return addresses. 189 00:15:43,168 --> 00:15:48,120 Tup, tup, tup, like that. So we are gonna pop off the FD, FC and FB, 190 00:15:48,120 --> 00:15:53,582 or the address after Fiii, FD, FC and FB. And we are actually gonna predict the 191 00:15:53,582 --> 00:15:58,958 return for this function or the leaf function, correctly, every single time. 192 00:15:58,958 --> 00:16:03,906 Where this gets a little hard, is if we have our call depth, be deeper, or 193 00:16:03,906 --> 00:16:09,179 recursion depth, if you will, be deeper, than how many entries we have in our 194 00:16:09,179 --> 00:16:14,036 return step predictor. Then we are gonna start predicting wrong. 195 00:16:14,036 --> 00:16:17,420 But otherwise we can do pretty well with this. 196 00:16:17,420 --> 00:16:22,911 In fact some architectures make this little bit easier on the hardware and 197 00:16:22,911 --> 00:16:28,322 it'll actually have a special instruction for the jump in link or the call in 198 00:16:28,322 --> 00:16:32,173 instruction. So for instances XA60 they, they have the 199 00:16:32,173 --> 00:16:37,707 call that tells the code basically or tells, excuse me tells the hardware to 200 00:16:37,707 --> 00:16:41,255 load the predictor. Because otherwise, a jump and link, 201 00:16:41,255 --> 00:16:44,259 there's too many other uses for a jump and link. 202 00:16:44,259 --> 00:16:46,423 It doesn't, it's not always a, function call. 203 00:16:46,423 --> 00:16:50,668 It's almost always a function call, but it's always a function call. 204 00:16:50,668 --> 00:16:54,509 But, for a call instruction on XA6, that's basically always a call. 205 00:16:54,509 --> 00:16:57,922 So you know to load this. And then, on XA6, there's a return 206 00:16:57,922 --> 00:17:01,297 instruction, which, is always a return from a function. 207 00:17:01,297 --> 00:17:04,182 You know? That's, that, the semantics of it aren't 208 00:17:04,182 --> 00:17:05,500 tied very well. Okay. 209 00:17:05,500 --> 00:17:10,111 So there's actually one other topic I want to talk about. 210 00:17:10,111 --> 00:17:15,472 Today that's not in the notes And this correlates to line prediction. 211 00:17:15,472 --> 00:17:20,811 We talked about line prediction. The one other thing I wanted to talk about 212 00:17:20,811 --> 00:17:24,990 is wave prediction. We haven't talked about this yet but is 213 00:17:24,990 --> 00:17:28,510 you have a let's say two way instruction cache. 214 00:17:28,510 --> 00:17:33,999 We talked about where to go look, of which line to go look at in the instruction 215 00:17:33,999 --> 00:17:38,606 cache, but one of the things that's just as hard is if you have two-way instruction 216 00:17:38,606 --> 00:17:43,774 cache, and you're trying to fetch the next instruction, which way do you go and shove 217 00:17:43,774 --> 00:17:50,785 into the decode stage of your pipe? Weigh zero or weigh one? 218 00:17:50,785 --> 00:17:57,006 Stuff you don't know. So this is one of the reasons why people 219 00:17:57,006 --> 00:18:02,376 would build direct map instruction caches. It's very hard to know which way is the 220 00:18:02,376 --> 00:18:05,971 correct way. But a predictors just like our branch 221 00:18:05,971 --> 00:18:09,849 predictors helps with this. We can build a way predictor. 222 00:18:09,849 --> 00:18:14,229 So we can have the same prediction architecture we've talked about today, in 223 00:18:14,229 --> 00:18:17,566 today's lecture. Dynamic, multi-level predictors, sometimes 224 00:18:17,566 --> 00:18:22,562 even baked into the, the branch predictor. And then that can be used to predict 225 00:18:22,562 --> 00:18:25,982 whether we choose from, let's say, way zero or way one. 226 00:18:25,982 --> 00:18:29,798 Or if we have a 4-way associative cache-way zero, one, two or three. 227 00:18:29,798 --> 00:18:34,661 Anyway the, that's what I wanted to talk about today and let's wrap up for here for