1 00:00:03,087 --> 00:00:06,098 Okay. So, now let's start talking about how to, 2 00:00:06,098 --> 00:00:12,058 leverage different, more techniques to get a lot of the performance, that we 3 00:00:12,292 --> 00:00:17,076 received, in, out of our superscalar, but in a VLIW processor setting. 4 00:00:20,032 --> 00:00:24,085 I want to break this sort of into two, two different things. 5 00:00:26,039 --> 00:00:30,098 Dynamic events, so things that happen in the processor that you can't necessarily, 6 00:00:31,135 --> 00:00:33,093 or you can't really do anything about statically. 7 00:00:33,093 --> 00:00:36,071 It's really hard to do anything statically about. 8 00:00:36,071 --> 00:00:40,073 You just don't know what's going to happen, you know, something like a, a, 9 00:00:40,073 --> 00:00:44,070 interrupt or an exception happening. How can a compiler know where the 10 00:00:44,070 --> 00:00:48,015 exceptions are going to happen? It's just not really possible. 11 00:00:48,088 --> 00:00:54,574 And then we'll talk about speculation which an out-of-order superscalar does and 12 00:00:54,574 --> 00:00:59,989 a lot of things have fallen to this case. Actually we can add things to classical 13 00:00:59,989 --> 00:01:05,481 VLIWs to get us closer to getting, that those, those performance improvements or 14 00:01:05,481 --> 00:01:10,353 the instructional parallelism improvements they can get from speculation. 15 00:01:10,353 --> 00:01:15,538 And we're talking about sort of two different types of speculation, today. 16 00:01:15,538 --> 00:01:20,498 We're going to talk about how to move or speculatively move instructions over 17 00:01:20,498 --> 00:01:23,954 branches. So, in particular, this is going to get 18 00:01:23,954 --> 00:01:27,548 really interesting with loads moving across branches. 19 00:01:27,548 --> 00:01:32,085 Because you, you have a load and you want to move the load as early as possible. 20 00:01:32,085 --> 00:01:37,645 You want to pull the load up early into your program so that if it misses in the 21 00:01:37,645 --> 00:01:40,185 cache. Or it has to go out to main memory, you 22 00:01:40,185 --> 00:01:43,222 can hide the latency while you're doing other things. 23 00:01:43,222 --> 00:01:48,365 We're also going to talk about speculation of speculating that two memory addresses, 24 00:01:48,545 --> 00:01:53,378 one of a load, let's say, one of a store, or you could also think about this as 25 00:01:53,378 --> 00:01:57,703 loads with loads and stores with stores. Well, we want to reorder those. 26 00:01:57,703 --> 00:02:01,955 Similarly, because we want to get, sort of, loads started early and stores started 27 00:02:01,955 --> 00:02:06,766 as late as possible, so we have enough time to do the computation we have when we 28 00:02:06,766 --> 00:02:10,167 do the scheduling. And remember in VLIWs, we're doing all 29 00:02:10,167 --> 00:02:12,893 this scheduling statically in the compiler. 30 00:02:12,893 --> 00:02:19,013 And as you may recall, from an out-of-order superscalar perspective, when 31 00:02:19,013 --> 00:02:23,728 you try to do load store speculation and try to reorder loads and stores 32 00:02:23,728 --> 00:02:29,312 dynamically in an out-of-order superscalar or some sort of, superscalar processor. 33 00:02:29,312 --> 00:02:34,519 You're actually, you need an extra structure to figure out when you made a 34 00:02:34,519 --> 00:02:37,202 mistake. When you reordered something, it was not a 35 00:02:37,202 --> 00:02:40,822 valid reordering. And we have to add that same structure 36 00:02:40,822 --> 00:02:44,313 into our VLIW processor or very similar structure. 37 00:02:44,521 --> 00:02:49,537 So let's, let's start off by talking about, what the compiler has to do, or how 38 00:02:49,537 --> 00:02:53,136 the compiler even starts to think about this speculation. 39 00:02:53,136 --> 00:02:56,819 So, I want to introduce a topic here, from compilers. 40 00:02:56,819 --> 00:03:02,654 And this is sort of an introduction to how we schedule for VLIW, the nitty gritty of 41 00:03:02,654 --> 00:03:08,012 what the instruction schedule or the back end of the compiler is going to be doing. 42 00:03:08,074 --> 00:03:14,428 So, here're two pieces of code. We have before code motion and after code 43 00:03:14,428 --> 00:03:17,062 motion. So what is code motion? 44 00:03:17,062 --> 00:03:23,144 Well, code motion is reordering of instructions in order to, tolerate the 45 00:03:23,144 --> 00:03:27,009 latencies of the instructions, or to do something good. 46 00:03:27,009 --> 00:03:31,959 So, it may not as, might not as just be tolerating the latencies of instructions, 47 00:03:31,959 --> 00:03:36,356 it might also be, reordering for alignment issues sometimes. 48 00:03:36,356 --> 00:03:42,456 Sometimes branches can only happen on certain alignments or the performance is 49 00:03:42,456 --> 00:03:45,495 better. So, but roughly, we are trying to reorder 50 00:03:45,495 --> 00:03:49,030 the code. Or pro, do valid reorderings of the code. 51 00:03:49,030 --> 00:03:54,899 We can't just move, a read after right dependence, and move it someplace else. 52 00:03:54,899 --> 00:04:00,115 We can't say, you know, instruction A reads the instruction, the result of 53 00:04:00,115 --> 00:04:04,226 instruction B. We can't move A before B, because it just 54 00:04:04,226 --> 00:04:11,046 wouldn't make any sense. A lot of times, the compiler has some 55 00:04:11,046 --> 00:04:13,664 flexibility, in what, what it can move around. 56 00:04:13,664 --> 00:04:16,752 Okay. So let's, let's look at this example case 57 00:04:16,752 --> 00:04:19,738 here. We have, got a little bit of code, and 58 00:04:19,738 --> 00:04:25,971 what we're going to try to do here. Well, two things to note. 59 00:04:25,971 --> 00:04:31,523 One, this branch at the end is dependent on register sixteen. 60 00:04:31,523 --> 00:04:37,688 Register sixteen of. Where does it get computed? 61 00:04:37,976 --> 00:04:42,650 Does it actually get computed anywhere in here? 62 00:04:42,650 --> 00:04:47,487 No. We're also saving away register sixteen, 63 00:04:47,487 --> 00:04:50,052 okay. So, does register sixteen get computed? 64 00:04:50,052 --> 00:04:54,635 Maybe not, but we probably want to, give it as much time to be computed as 65 00:04:54,635 --> 00:04:57,075 possible. Some of the other things we have on here. 66 00:04:57,075 --> 00:05:00,084 We have a load. Okay. 67 00:05:00,084 --> 00:05:05,406 Well, the load. What if load cache misses? 68 00:05:05,406 --> 00:05:08,459 Well. Well, this next instruction here is 69 00:05:08,459 --> 00:05:12,677 directly dependent on that load. So, we probably want to give it as much 70 00:05:12,677 --> 00:05:16,567 time as possible to execute. So, traditionally in code motion, you want 71 00:05:16,567 --> 00:05:19,369 to push the loads up and push the stores down. 72 00:05:19,369 --> 00:05:24,603 What we are going to do this if we were to push the loads as high up as possible and 73 00:05:24,603 --> 00:05:28,922 the stores down as possible. We are going to move a load and a store 74 00:05:28,922 --> 00:05:31,985 past each other. And the question comes up what if this 75 00:05:31,985 --> 00:05:34,648 value and this value is two addresses are the same. 76 00:05:34,648 --> 00:05:37,585 Hm, well so let's, let's assume that they are not. 77 00:05:37,585 --> 00:05:41,211 And let's assume the compiler can prove they're not. 78 00:05:41,211 --> 00:05:47,350 So, you should be able to do code motion here where we're gonna rejumble all the 79 00:05:47,350 --> 00:05:51,559 instructions and we're going to push the load up and push the store down. 80 00:05:51,559 --> 00:05:54,093 Leave the branch at the, excuse me, at the end. 81 00:05:54,093 --> 00:05:57,061 So, we've done, we've done some code motion. 82 00:05:57,061 --> 00:06:02,083 The next thing we want to do is we're going to want to bundle or schedule the 83 00:06:02,083 --> 00:06:05,651 instructions. So, when we say bundle, we're going to 84 00:06:05,651 --> 00:06:10,037 take several operations. And pack them together into one VLIW 85 00:06:10,037 --> 00:06:15,078 instruction or one VLIW bundle. And we denote that here with these braces 86 00:06:15,078 --> 00:06:19,048 to say, these execute parallel or at the same time. 87 00:06:19,048 --> 00:06:23,003 So first, you're going to sort of want to do scheduling. 88 00:06:23,003 --> 00:06:28,074 Then you want to somehow bundle nondependent instructions together. 89 00:06:29,004 --> 00:06:33,077 There's been a couple algorithms I've tried to do sort of scheduling, code 90 00:06:33,077 --> 00:06:36,095 motion scheduling and everything, and bundling together. 91 00:06:36,095 --> 00:06:41,046 Those don't always work out as well. So, most of VLIW pros-, most of the VLIW 92 00:06:41,046 --> 00:06:45,062 compilers now will actually do the code motion and the scheduling first. 93 00:06:45,062 --> 00:06:48,033 And then they'll do the bundling, at the end. 94 00:06:51,087 --> 00:06:55,007 Okay. So, let's look at our first technique here 95 00:06:55,007 --> 00:07:00,566 to get at instructional parallelism. And our first type of, speculation in VLIW 96 00:07:00,566 --> 00:07:04,018 processors. So what's, what's, what's our challenge 97 00:07:04,018 --> 00:07:06,056 here? What's our, what's our problem? 98 00:07:06,056 --> 00:07:12,026 Well, our problem is the branches or the presence of a branch rather, is going to 99 00:07:12,026 --> 00:07:16,052 restrict the forms of code motion the compiler can do. 100 00:07:16,052 --> 00:07:20,040 So, we have a branch, we have some code after the branch. 101 00:07:20,040 --> 00:07:25,090 If you have lets say, an add instruction. That's pretty easy to move it before the 102 00:07:25,090 --> 00:07:27,034 branch. Okay. 103 00:07:27,313 --> 00:07:33,092 How about an instruction which can take a fault or take an interrupt? 104 00:07:37,025 --> 00:07:43,008 Loads are a good example of some, one of those sorts of instructions. 105 00:07:43,008 --> 00:07:48,078 How's that possible? Well, a load can take a, a page miss or a 106 00:07:48,078 --> 00:07:53,037 page fault. It can try to, access some value which is 107 00:07:53,037 --> 00:07:58,718 not mapped in by the operating system. So, let's say we take this load and we 108 00:07:58,718 --> 00:08:04,021 move it above the branch. Okay, that's not too bad. 109 00:08:04,021 --> 00:08:08,071 Now, what happens if that load takes a trap or takes an interrupt? 110 00:08:08,071 --> 00:08:13,095 That was never, never supposed to happen. In correct program sequence order, the 111 00:08:13,095 --> 00:08:16,231 branch was going to branch around the load. 112 00:08:16,231 --> 00:08:22,235 So, the load was never going to execute. But all of a sudden, the load got moved up 113 00:08:22,235 --> 00:08:26,291 and the load executes and it takes an interrupt. 114 00:08:26,291 --> 00:08:34,897 Well, your program crashes, it takes, either a SEGV or a bus error if you're on 115 00:08:34,897 --> 00:08:42,232 something like a SPARC system or a, you know, the reason of taking a segmentation 116 00:08:42,232 --> 00:08:46,437 fault. Even though nothing, nothing went wrong. 117 00:08:46,437 --> 00:08:51,595 This branch was supposed to protect this load from never executing. 118 00:08:51,595 --> 00:08:56,004 So R1 here, let's say, just gets loaded with some bogus value if the branch is 119 00:08:56,004 --> 00:09:00,313 taken, and the compiler, the compiler knew that, and the programmer knew that. 120 00:09:00,313 --> 00:09:04,953 But, all of a sudden, the compiler decides to reorder things and has moved the load 121 00:09:04,953 --> 00:09:07,049 up. Now, that, because it wanted to try to get 122 00:09:07,049 --> 00:09:10,075 more performance. This is something that you can do in your 123 00:09:10,075 --> 00:09:13,640 out-of-order superscalars also. That was our out-of-order superscalars are 124 00:09:13,640 --> 00:09:18,443 going to pull nondependent loads up in the program and try to move them across 125 00:09:18,443 --> 00:09:21,464 branches. But in, in out-of-order superscalar, if 126 00:09:21,464 --> 00:09:26,119 you pull a load up and it does something wrong, you just do a speculative state, 127 00:09:26,119 --> 00:09:29,989 you just don't, don't care about it. So, we need some way to have loads that 128 00:09:29,989 --> 00:09:33,700 don't hurt the system. So, it's a similar sort of idea, but, you 129 00:09:33,700 --> 00:09:37,602 know. We can't move with load above the branch, 130 00:09:37,602 --> 00:09:42,214 cause it might cause an exact, some sort of exception. 131 00:09:42,214 --> 00:09:48,572 So a solution to this is, is we're going to add special instructions, special 132 00:09:48,572 --> 00:09:54,168 versions of the instructions which don't take faults. 133 00:09:54,168 --> 00:09:57,751 So. If the fault were to happen, it sort of 134 00:09:57,751 --> 00:10:02,542 remembers that the fault happened. But it doesn't actually interrupt the 135 00:10:02,542 --> 00:10:05,630 process. And this is, this allows us to pull some 136 00:10:05,630 --> 00:10:11,166 of these instructions before branches or reorder perform code motion statically at 137 00:10:11,166 --> 00:10:12,381 compile time. Okay. 138 00:10:12,381 --> 00:10:15,066 So, let's take a look at this example here. 139 00:10:15,066 --> 00:10:21,022 How would we go about doing this? Okay, so we're gonna move the load up. 140 00:10:21,022 --> 00:10:23,565 So, let's say we move the load as high as possible. 141 00:10:23,565 --> 00:10:27,543 We move it above these, these two instructions instruction one, instruction 142 00:10:27,543 --> 00:10:33,041 two, here. And what do we need to do? 143 00:10:33,041 --> 00:10:37,015 Well, if you note, we don't have the word load here anymore. 144 00:10:37,015 --> 00:10:39,423 We have Load.s. So, it's a speculative load. 145 00:10:39,423 --> 00:10:44,199 It has a different semantics. The Load.s never causes an exception even 146 00:10:44,199 --> 00:10:50,053 if it has a page fault, or a null point or exception, or any underlying memory 147 00:10:50,053 --> 00:10:55,072 reference or anything like that. It's not gonna actually, take that fault. 148 00:10:56,008 --> 00:10:59,025 Instead, it's just going to sort of remember that. 149 00:10:59,025 --> 00:11:05,019 And this is typically done if you look at something like Itanium, which is the, this 150 00:11:05,019 --> 00:11:08,099 is the sort of code sequence out of Itanium processors. 151 00:11:08,099 --> 00:11:14,024 In Itanium, what they're going to do is, they're going to, the, the destination of 152 00:11:14,024 --> 00:11:18,039 a load, in this case R1, is going to be set with a poison bit. 153 00:11:18,059 --> 00:11:21,084 Itanium this is actually called a NaT number. 154 00:11:21,084 --> 00:11:27,422 And what it means is the destination register of a load is marked as not there. 155 00:11:27,422 --> 00:11:30,675 Or that's, that's not a thing or not a number. 156 00:11:30,675 --> 00:11:36,148 And what's important about this is, when you go to use that value. 157 00:11:36,148 --> 00:11:43,149 That's when you take the exception. So let's, let's take a look at example 158 00:11:43,149 --> 00:11:47,309 this. What we do is down here where the load 159 00:11:47,309 --> 00:11:53,708 original was we put in a check speculative instruction, and it uses r1 here. 160 00:11:53,708 --> 00:12:01,207 So, what this is going to do is check the value of r1, and if r1 was, took a trap or 161 00:12:01,207 --> 00:12:07,066 exception there was some problem with it. We can jump to fix-up code. 162 00:12:07,066 --> 00:12:12,005 So, there will be some other piece of code somewhere else we can branch to. 163 00:12:12,005 --> 00:12:16,435 And a fix-up code is gonna re-execute the load, and make sure everything is okay, 164 00:12:16,435 --> 00:12:19,979 and then, then jump back to the use. So, it's speculation. 165 00:12:19,979 --> 00:12:24,382 We try to prove, we try to, make sure, that the common case when we pull this 166 00:12:24,382 --> 00:12:29,233 load up, nothing bad is going to happen. But we, we put a check in just to make 167 00:12:29,233 --> 00:12:35,203 sure that it's, this is the case. One other thing I want to point out is 168 00:12:35,203 --> 00:12:41,112 these poison bits, or these not a, not a numbers, they propagate. 169 00:12:41,112 --> 00:12:45,810 So why is, why is this important? Well, what that mean, is. 170 00:12:46,042 --> 00:12:50,240 Actually I want to correct me and said a little bit here. 171 00:12:50,240 --> 00:12:55,252 If you try to use R1 somewhere else before you do a check, the not a number 172 00:12:55,252 --> 00:12:57,427 propagates. It doesn't take a trap. 173 00:12:57,427 --> 00:13:01,888 It only takes a trap or only jumps to fix up code when you go to actually do the 174 00:13:01,888 --> 00:13:04,380 use. Or, or when you have a special check 175 00:13:04,380 --> 00:13:09,314 instruction or a there's another load you can put there that will re-execute the 176 00:13:09,314 --> 00:13:14,678 load, if the, if the original load didn't work, a non-speculative load to the same, 177 00:13:14,678 --> 00:13:19,086 to the same value. Okay, so, so why is this propagation of 178 00:13:19,086 --> 00:13:23,649 not a numbers important? Well, this allows you to pull code down 179 00:13:23,649 --> 00:13:28,656 here up and above this branch also. So, we can put the full load up, we can 180 00:13:28,656 --> 00:13:35,005 pull dependent instructions up and just as long as we do the check at some point. 181 00:13:36,082 --> 00:13:41,022 And check to make sure that the load was not. 182 00:13:41,072 --> 00:13:47,024 Didn't take an exception not a problem. We can basically speculate that the load 183 00:13:47,024 --> 00:13:50,087 is fine. We can speculate that the instructions to 184 00:13:50,087 --> 00:13:54,053 pan a load are going to be fine. But if and even if depending on the 185 00:13:54,053 --> 00:13:57,380 instructions on the load are not fine, that we've also speculated. 186 00:13:57,380 --> 00:14:01,903 When we go to, at some point in our code we need to introduce this check 187 00:14:01,903 --> 00:14:05,061 instruction here which sees that this value is not a number. 188 00:14:05,061 --> 00:14:10,104 So, if it propagates all the way through your code, you have a load to add to add a 189 00:14:10,104 --> 00:14:14,072 sub but which all reads and writes, let's say register, or, somehow reads from 190 00:14:14,072 --> 00:14:16,440 register one. And is sort of depends a tree from 191 00:14:16,440 --> 00:14:19,688 register one, through those, that code sequence. 192 00:14:19,688 --> 00:14:25,079 At some point, you want to check after the branch has actually happened whether 193 00:14:25,079 --> 00:14:29,027 everything's okay. And because it all propagates, you can 194 00:14:29,027 --> 00:14:32,432 basically do work early, and then check later. 195 00:14:32,432 --> 00:14:36,881 And when you do the check later, you can say oh, I get it, everything's okay, and 196 00:14:36,881 --> 00:14:40,473 it has a value in, in it. It has not number in it, or it has this 197 00:14:40,473 --> 00:14:43,848 poisoned bit set, and it's propagated all the way through. 198 00:14:43,848 --> 00:14:48,473 At that point, and, you can jump to fixup code, which you can re-execute the load, 199 00:14:48,473 --> 00:14:53,216 and all of the code you've pulled up. All that speculative code, and that's why 200 00:14:53,216 --> 00:14:56,270 you need, fix-up code and not just re-execution of the load. 201 00:14:56,270 --> 00:15:00,417 Because the fixup code might have to re-execute more than just the load. 202 00:15:00,417 --> 00:15:04,898 It might have execute the load, and all the code, the other code you pulled off 203 00:15:04,898 --> 00:15:08,416 the front of the branch. Okay. 204 00:15:08,416 --> 00:15:12,261 So, let's, let's move onto another, speculation. 205 00:15:12,261 --> 00:15:20,166 And this is data speculation. So, in data speculation, our, our problem 206 00:15:20,166 --> 00:15:26,090 here is we have memory hazards that are going to limit our scheduling. 207 00:15:26,090 --> 00:15:30,041 Especially if we have to do all the scheduling statically. 208 00:15:30,041 --> 00:15:34,008 So, if you're conservative in your compiler, and you can't prove anything 209 00:15:34,008 --> 00:15:37,060 about the different addresses. All of a sudden what you have is, every 210 00:15:37,060 --> 00:15:40,025 single load is dependent on every single other load. 211 00:15:40,025 --> 00:15:43,082 And every single store is dependent on every single other, other store. 212 00:15:43,082 --> 00:15:46,011 Now, why I say why are loads depend on other loads? 213 00:15:46,011 --> 00:15:50,034 Although most loads you might be able to reorder pretty easily, you have to worry a 214 00:15:50,034 --> 00:15:52,094 little bit about if one takes a trap before the other. 215 00:15:53,141 --> 00:15:56,045 I guess we kind of already talked about how to solve that problem. 216 00:15:56,045 --> 00:15:59,087 But if you, once you reorder a load in the store. 217 00:15:59,087 --> 00:16:04,040 Then you start to have problems because, or reorder a store in a store, or anything 218 00:16:04,040 --> 00:16:08,061 from a store, then you start to have problems because all of a sudden, you have 219 00:16:08,061 --> 00:16:13,014 some state that changed in your memory system that you have make sure that the an 220 00:16:13,014 --> 00:16:17,051 original program or the later instructions actually pick up the result of that. 221 00:16:17,051 --> 00:16:20,077 So, later loads or stores somehow honor the results of that. 222 00:16:22,004 --> 00:16:26,059 So, it's really sort of focused on moving loads of past stores. 223 00:16:26,059 --> 00:16:31,036 So, what's our solution? Well, our solution is to add some hardware 224 00:16:31,036 --> 00:16:35,783 check to guarantee that the pointers or the two memory addresses when you do the 225 00:16:35,783 --> 00:16:40,240 reordering of a load in a store, or a store in a store, that these addresses are 226 00:16:40,240 --> 00:16:43,126 not the same. And this is actually the same thing that 227 00:16:43,126 --> 00:16:47,157 we did in our load store queue, or our load store disam-, memory disambiguation 228 00:16:47,157 --> 00:16:50,062 that we talked about, in out-of-order processors. 229 00:16:50,062 --> 00:16:54,949 So it's, it's the same. Same sort of thing going on here. 230 00:16:54,949 --> 00:17:00,571 But we need to actually change the instruction set to handle this in a VLIW 231 00:17:00,571 --> 00:17:06,000 setting versus in the superscalar was all microarchitectural issues, sort of below 232 00:17:06,000 --> 00:17:09,005 the instruction set that no one ever had to look at. 233 00:17:09,094 --> 00:17:12,075 Okay. So let's, let's look at how to do this. 234 00:17:12,075 --> 00:17:17,007 We have a code sequence here. Some instructions, a store, and then a 235 00:17:17,007 --> 00:17:20,068 load. Compiler can't prove anything about these 236 00:17:20,068 --> 00:17:24,061 stores and this load. But in the out-of-order superscalar, it 237 00:17:24,061 --> 00:17:29,098 could guess or speculate that the load and the store don't share the same address. 238 00:17:29,098 --> 00:17:35,009 And if they did, it could basically roll back the code, and re-execute the load. 239 00:17:35,009 --> 00:17:40,052 I'm going to emulate something like this but using, doing it all statically with 240 00:17:40,052 --> 00:17:44,069 the compiler. Okay, so how, how do we do this? 241 00:17:44,069 --> 00:17:50,069 Well, this is something else that, Itanium added and it also talked about at a 242 00:17:50,069 --> 00:17:57,034 previous, research VLIW processors or, explicitly parallel instructions and 243 00:17:57,034 --> 00:18:02,029 processors. What you're gonna do here, is you're going 244 00:18:02,029 --> 00:18:08,723 to move the load up above the store. And this is basically speculating that the 245 00:18:08,723 --> 00:18:11,609 load in the stores don't hit the same address. 246 00:18:11,609 --> 00:18:16,892 And much the same way that we, when we reorder loads of those stores in an 247 00:18:16,892 --> 00:18:22,437 out-of-order superscalar added the load address to a load store queue and a memory 248 00:18:22,437 --> 00:18:26,393 disambiguation queue. We're gonna do the same sort of thing 249 00:18:26,393 --> 00:18:29,028 here. That we're, when we move this load up 250 00:18:29,028 --> 00:18:32,983 we're gonna have a different instruction here of load and add. 251 00:18:32,983 --> 00:18:42,373 So, what this is going to is it's going to add, it's going to do the load and add the 252 00:18:42,373 --> 00:18:48,685 address to a special table which we're going to call the address, what's does 253 00:18:48,685 --> 00:18:52,076 ALAT stand for? Address load, Advanced Load. 254 00:18:52,076 --> 00:18:56,037 Address Table. The ALAT is what Itanium calls this. 255 00:18:56,037 --> 00:19:01,008 So, it's a, it's a structure that's a associative structure where you can put 256 00:19:01,008 --> 00:19:05,058 addresses in it, and then check, other instructions, other addresses from other 257 00:19:05,058 --> 00:19:09,026 instructions against it. And if, you get a hit or something like 258 00:19:09,026 --> 00:19:11,071 that, you know that something bad happened. 259 00:19:11,071 --> 00:19:15,010 And you can, re-execute. So, we're going to see here, that, we're 260 00:19:15,010 --> 00:19:19,037 going to have a load, we're just going to get the load out in the memory system. 261 00:19:19,037 --> 00:19:24,206 The, we can overlap the time in the memory system with executing other instructions. 262 00:19:24,206 --> 00:19:28,402 We have a store here. Now, the store is going to take the 263 00:19:28,402 --> 00:19:32,387 address of itself, and we've changed the instruction set here. 264 00:19:32,387 --> 00:19:37,599 So, we've added these two new instructions here, and changed the semantics of a 265 00:19:37,599 --> 00:19:40,438 store. And it's going to check in the ALAT, or 266 00:19:40,438 --> 00:19:45,231 Advanced Load Address Table, to see if the other see if the address the stores 267 00:19:45,231 --> 00:19:48,209 happening to is already there. And if, if it is. 268 00:19:48,209 --> 00:19:51,573 It's going to, going to change the table somehow. 269 00:19:51,573 --> 00:19:55,183 And we'll talk about that in a second of what exactly happens. 270 00:19:55,183 --> 00:19:58,663 But roughly it's going to, take this address out of the table. 271 00:19:58,663 --> 00:20:01,463 And then when this load check happens here. 272 00:20:01,463 --> 00:20:05,781 This checks to see if the address is still in this table. 273 00:20:05,781 --> 00:20:12,330 And if it's been bumped out of the table. It jumps to fixup code, which is basic 274 00:20:12,330 --> 00:20:18,218 going to re-execute the load, or re-execute the load and all the dependent 275 00:20:18,218 --> 00:20:22,516 instructions of the load. So, we can, effectively do data 276 00:20:22,516 --> 00:20:28,690 speculation, but all in software. But this is going to require us, to have 277 00:20:28,690 --> 00:20:35,538 an associative hardware block or a some sort of content addressable memory or CAM, 278 00:20:35,768 --> 00:20:41,791 that we need to add that all load address and all store addresses, are, passed 279 00:20:41,791 --> 00:20:45,072 across. This is similar to how we had our memory 280 00:20:45,072 --> 00:20:48,884 disambiguation queue in our out-of-order superscalar. 281 00:20:48,884 --> 00:20:54,304 Okay so, lets look at the ALAT, the Advanced Load Address Table. 282 00:20:54,304 --> 00:21:00,065 Okay, so what, what do we have in here? Well three things. 283 00:21:00,065 --> 00:21:07,031 We store the address. We store the size, this is important so we 284 00:21:07,031 --> 00:21:15,056 know, is it word, double word, half word, bite sort of thing when we go and do 285 00:21:15,056 --> 00:21:20,756 matching. And when a Load.a happens. 286 00:21:20,756 --> 00:21:27,603 It adds one into this table, and we, we, we denote the register number that is the, 287 00:21:27,603 --> 00:21:31,600 the target of this also. It's kind of, kind of useful to know and 288 00:21:31,600 --> 00:21:35,110 we go to do the check. This is to match up to make sure that the 289 00:21:35,110 --> 00:21:39,739 same load is the same load from before, and we can effectively remove the address 290 00:21:39,739 --> 00:21:44,095 from the table when we do the check. Okay, so we added an address. 291 00:21:44,095 --> 00:21:51,957 Okay, we're running along and though we execute stores but the stores don't hit on 292 00:21:51,957 --> 00:21:55,377 that address. So, when a store executes, what is it 293 00:21:55,377 --> 00:21:58,095 going to do? Well, it looks in this table and cams or, 294 00:21:58,095 --> 00:22:03,065 or does a comparison against all the addresses with size matching, and 295 00:22:03,065 --> 00:22:08,073 everything, against this table. And if it's in the table, it removes it 296 00:22:08,073 --> 00:22:12,357 from the table. So, by the time that the check or the load 297 00:22:12,357 --> 00:22:16,041 check happens, that address is not in the table, then. 298 00:22:16,041 --> 00:22:20,578 So, when the load check happens, it's going to look in here at this table again 299 00:22:20,578 --> 00:22:25,086 and say, oh, I'm not in the table anymore. This address is not in the table anymore. 300 00:22:25,086 --> 00:22:29,050 So at that point, you can know, what's going to re-execute that load. 301 00:22:29,050 --> 00:22:34,035 But if it's still in the table, that means no stores hit it, bumped it out of the 302 00:22:34,035 --> 00:22:36,692 table. So, no stores in, you know, the leading 303 00:22:36,692 --> 00:22:39,087 time bumped it out. So, the speculative load was fine. 304 00:22:39,087 --> 00:22:43,075 So, we've effectively done what out-of-order superscalars can do. 305 00:22:43,075 --> 00:22:48,042 We can move now loads. Above stores, and through a changed or 306 00:22:48,042 --> 00:22:55,009 instruction set architecture, by adding these special check-in, load check-in, and 307 00:22:55,009 --> 00:22:59,473 check instructions. And by changing the semantics of load, and 308 00:22:59,473 --> 00:23:03,696 adding a load, add, or load a instruction here. 309 00:23:03,696 --> 00:23:09,043 We can change. We can, we can have, data speculation into 310 00:23:09,043 --> 00:23:11,626 our program. Okay. 311 00:23:11,626 --> 00:23:18,341 A couple other things that are fun to add to classical LVIW's that, to improve 312 00:23:18,341 --> 00:23:22,207 performance. I just kind of want to just breeze through 313 00:23:22,207 --> 00:23:25,634 these, these next two ones a little bit faster here. 314 00:23:25,634 --> 00:23:29,296 First one, multiwave branches. So problem, you have brand VLI 315 00:23:29,296 --> 00:23:34,873 architecture, or very long instruction word architecture and if you have, let's 316 00:23:34,873 --> 00:23:40,216 say, ten things in one instruction, or ten, ten operations in one instruction, or 317 00:23:40,216 --> 00:23:45,884 ten operations in one bundle, but you have sort of short branches or maybe you have 318 00:23:45,884 --> 00:23:51,028 multiple branch decisions happening. You want to have sort of a branch every 319 00:23:51,028 --> 00:23:54,092 three instructions. Well, you have lots of wasted slots. 320 00:23:54,092 --> 00:23:59,056 If you can't somehow have one instruction branch multiple directions. 321 00:23:59,056 --> 00:24:01,050 They might say branching multiple directions? 322 00:24:01,050 --> 00:24:04,080 That's, That's a really weird idea. How do you have an instruction go multiple 323 00:24:04,080 --> 00:24:06,060 ways? How do branches could go one or two 324 00:24:06,060 --> 00:24:07,900 places? They either fall through or go some place 325 00:24:07,900 --> 00:24:11,285 else. Well, we could actually change the 326 00:24:11,285 --> 00:24:15,669 instruction set to have a branch that can go multiple directions. 327 00:24:15,669 --> 00:24:19,695 So, let's look at this with a full predication example here. 328 00:24:19,695 --> 00:24:24,742 So we have two bundles of code here. This first one here is three instructions 329 00:24:24,742 --> 00:24:27,372 and it these are all executed in, in parallel. 330 00:24:27,372 --> 00:24:30,706 And what its going to do it's going to do three different comparers. 331 00:24:30,706 --> 00:24:34,802 So, this is similar to our, this is our full predication example here. 332 00:24:34,802 --> 00:24:38,958 It's going to write to these predicate registers in with the, the logical, 333 00:24:38,958 --> 00:24:44,195 positive and logical negative sort of versions of, of the predicate comparison. 334 00:24:44,195 --> 00:24:50,229 Okay, so that sort of stages us to do three different branch comparisons. 335 00:24:50,229 --> 00:24:57,129 And then, down here we have three branches that all branch based on this predicate. 336 00:24:57,129 --> 00:25:02,453 Or these different predicates that we already staged or, or setup up here. 337 00:25:02,453 --> 00:25:05,566 And they can branch to three different places. 338 00:25:05,566 --> 00:25:10,263 Label one, label two, and label three. This also could be fall-through code, if 339 00:25:10,263 --> 00:25:13,908 none of the branches are taken. And, you know, one little wriggle here 340 00:25:13,908 --> 00:25:18,170 that you have to have sort of figure out the semantics in, and I should say Itanium 341 00:25:18,170 --> 00:25:21,192 supports this. You can buy real chips with this in it. 342 00:25:21,192 --> 00:25:26,084 If you have three branches together, you need to sum up over of these branches 343 00:25:26,084 --> 00:25:31,052 because, let's say, they all say taken. Well, they all, all resolve to actually 344 00:25:31,052 --> 00:25:33,160 being taken. Well which one do you take? 345 00:25:33,160 --> 00:25:37,472 Because you could branch all three different places, and want deterministic 346 00:25:37,472 --> 00:25:39,820 execution. So, typically, when you have an 347 00:25:39,820 --> 00:25:44,240 instruction set like this of multi-way branching, you define some order. 348 00:25:44,240 --> 00:25:49,287 Maybe the first one has priority over the second one, which has priority over the 349 00:25:49,287 --> 00:25:50,088 third one, etc.