1 00:00:03,033 --> 00:00:12,943 Let's move on to memory disambiguation. So, this is kind of the analog of read 2 00:00:12,943 --> 00:00:18,078 after write hazards through the memory system. 3 00:00:22,085 --> 00:00:23,284 Whoops. Okay. 4 00:00:23,284 --> 00:00:31,067 We have a basic instruction sequence here. We have a store, followed by a load. 5 00:00:31,096 --> 00:00:36,042 When can we execute the load? Well, that's a tough one. 6 00:00:36,042 --> 00:00:42,008 If the load's not dependent on the store, we can just go execute out of order. 7 00:00:42,008 --> 00:00:46,590 We can potentially even execute the load before the store, in an out of order 8 00:00:46,590 --> 00:00:52,115 machine. If the load's dependend on the store, so 9 00:00:52,115 --> 00:00:57,589 if let's say, register two is equal to register four, then we're going to have a 10 00:00:57,589 --> 00:01:01,956 problem. So, what's, how do, how do we go about 11 00:01:01,956 --> 00:01:07,761 trying to solve this? Up, up until this point we, in our pipes 12 00:01:07,761 --> 00:01:15,278 we've said, okay, loads in stores are in order cuz that, that can solve these 13 00:01:15,278 --> 00:01:18,571 problems. So, we, let's say, we just execute all 14 00:01:18,571 --> 00:01:24,089 those in stores in order and we can still allow other instructions to move around 15 00:01:24,089 --> 00:01:27,009 it. So, it's out of order relative to other 16 00:01:27,009 --> 00:01:30,051 instructions, but we have in-order memory instructions. 17 00:01:30,051 --> 00:01:34,070 And, that's going to guarantee that we're not going to have any problems. 18 00:01:35,079 --> 00:01:38,083 Well, we're leaving performance on the table here. 19 00:01:38,083 --> 00:01:42,043 We could go faster. We could actually execute loads and stores 20 00:01:42,043 --> 00:01:49,490 out of order if R2 and R4 are different. But, we're probably going to need some 21 00:01:49,490 --> 00:01:53,011 extra structure to do that. Okay. 22 00:01:53,011 --> 00:01:59,051 So, say, we look at a conservative out-of-order load example. 23 00:02:00,006 --> 00:02:07,041 So, our, our conservative one, let's split the store into two sort of sub-parts. 24 00:02:07,041 --> 00:02:12,009 An address computation, and the actual data write. 25 00:02:13,002 --> 00:02:20,680 By doing this, we can guarantee that the store, because we do the address 26 00:02:20,680 --> 00:02:25,913 computation early, we can know that, that subsequent load is not going to alias, or 27 00:02:25,913 --> 00:02:29,252 we know the R4 is not going to be equal to R2. 28 00:02:29,252 --> 00:02:33,080 And, we can go execute the load before the store in this case. 29 00:02:33,080 --> 00:02:39,012 By splitting the store into two different parts, the actual store portion and the 30 00:02:39,012 --> 00:02:44,557 address computation. And, we still need to, unfortunately, 31 00:02:44,557 --> 00:02:51,145 check every single load against all previous uncommitted stores or need some 32 00:02:51,145 --> 00:02:57,233 structures to go do that. So, instead of a, instead of a kind of 33 00:02:57,233 --> 00:03:03,003 problem, cuz it's a bummer. And, we're not able to execute any load if 34 00:03:03,003 --> 00:03:08,196 we don't know the store address of any of the stores before us because we're not, 35 00:03:08,196 --> 00:03:14,485 we're just not going to know whether we can go execute that or not. 36 00:03:14,485 --> 00:03:22,572 Okay, well, can we do one better? Let's start speculating because let's, 37 00:03:22,572 --> 00:03:30,008 let's guess that these are not equal. Let's execute the load before the store. 38 00:03:32,074 --> 00:03:40,034 And then, we hold off committing the loads and store instructions, and we do that in 39 00:03:40,034 --> 00:03:45,036 order. And, if something bad happens, so if the 40 00:03:45,036 --> 00:03:54,053 registers do equal each other, we have to replay all of these instructions. 41 00:03:55,041 --> 00:03:59,018 And what's, what's annoying is we also have to kill all of the subsequent 42 00:03:59,018 --> 00:04:01,097 instructions. So, anything that was after that load and 43 00:04:01,097 --> 00:04:05,728 store which could have picked up a value from loading the store, or, well, rather, 44 00:04:05,728 --> 00:04:08,352 from the load. Any dependent instructions the load, we're 45 00:04:08,352 --> 00:04:11,921 going to have to go kill all that. And, depending on how precise we want to 46 00:04:11,921 --> 00:04:16,226 be, we can either try to kill just everything after it, or we can try to kill 47 00:04:16,226 --> 00:04:19,354 just the selectively, the things which are dependent on it. 48 00:04:19,354 --> 00:04:24,459 I think it's really hard to track. And, and we have a pretty high penalty 49 00:04:24,459 --> 00:04:29,496 here for inaccurate address speculation. So, if we go and we just execute these two 50 00:04:29,496 --> 00:04:34,187 instructions, we put them into the pipe and they were dependent on each other, we 51 00:04:34,187 --> 00:04:43,929 have to rollback a lot of state. So, one, one, one sort of heuristic on 52 00:04:43,929 --> 00:04:52,042 this is, was, was done in the Alpha 21264, is called Memory Dependence Prediction. 53 00:04:52,042 --> 00:04:58,088 So, what you do is you're still going to guess that the loads in the stores do not 54 00:04:58,088 --> 00:05:03,047 alias. But, then later, if you find that the two 55 00:05:03,047 --> 00:05:08,097 registers do equal each other, you're going to squash all the instructions. 56 00:05:08,097 --> 00:05:12,054 But then, you're going to do something special to that load. 57 00:05:13,007 --> 00:05:17,333 You basically going to say, that's a, that's a, wait, wait for the store, you 58 00:05:17,333 --> 00:05:21,088 sort of put barrier in. So, you're not going to get multiple roll 59 00:05:21,088 --> 00:05:26,012 backs over and over again. So, this is kind of a conservative way 60 00:05:26,012 --> 00:05:29,813 that when you see you're rolling back a lot, there's a lot of thrown away work 61 00:05:29,813 --> 00:05:33,017 that you're doing there. So instead, you can just say, well, this 62 00:05:33,017 --> 00:05:37,362 load, I think is dependent on that store. You can potentially, even remember that 63 00:05:37,362 --> 00:05:41,264 across instruction sequences. So, if it's always the case, let's say, 64 00:05:41,264 --> 00:05:45,307 this load is dependent on the store, you could potentially keep it into your 65 00:05:45,307 --> 00:05:49,616 instruction cache or something like that. And then, you know, when you go to execute 66 00:05:49,616 --> 00:05:54,112 that same load at a different time, it's going to wait for all the stores to 67 00:05:54,112 --> 00:05:58,141 complete and it's going to barrier. This is kind of a prediction. 68 00:05:58,141 --> 00:06:03,640 It's just heuristic, but it's one way you could potentially not cause that load to 69 00:06:03,640 --> 00:06:06,538 always replay. And, if you have sort of multiple 70 00:06:06,538 --> 00:06:10,743 dependent loads or, or, excuse me, it loads in one store you could potentially, 71 00:06:10,743 --> 00:06:14,076 cause those loads not to have to replay multiple times. 72 00:06:14,076 --> 00:06:19,061 But, the, but the big advantage here is when you go to re-execute that load some 73 00:06:19,061 --> 00:06:24,093 time in the future, you're not going to flush the entire pipe. 74 00:06:24,093 --> 00:06:28,091 So, this is kind of like a branch prediction, if you will. 75 00:06:28,091 --> 00:06:34,051 It's like, this branch I've, I've trained the predictor to say that, that load 76 00:06:34,051 --> 00:06:37,527 usually is dependent on the previous store. 77 00:06:37,527 --> 00:06:42,180 So, just wait for all the other stores to clear out of the pipe. 78 00:06:42,180 --> 00:06:45,771 So, it's a cute little trick. Okay. 79 00:06:45,771 --> 00:06:51,371 So, we're almost out of time. All right. 80 00:06:51,371 --> 00:07:00,660 So, speculative loads and stores. We're going to introduce a store buffer to 81 00:07:00,660 --> 00:07:05,497 hold the speculative state. So, let's take a look very briefly here, 82 00:07:05,497 --> 00:07:09,010 I'll flash up what a store buffer looks like. 83 00:07:10,037 --> 00:07:13,079 And then, we'll think about it for a second here. 84 00:07:13,079 --> 00:07:17,069 So, here you have your cache, your L1 data cache we'll say. 85 00:07:18,062 --> 00:07:23,004 We're going to send all the addresses to go to the L1 cache, also to this other 86 00:07:23,004 --> 00:07:26,001 structure here which is the speculative store buffer. 87 00:07:26,043 --> 00:07:33,319 Inside of the speculative store buffer, there is going to be bits basically saying 88 00:07:33,319 --> 00:07:39,011 whether it's valid or not. And, the load is going to check against 89 00:07:39,011 --> 00:07:42,001 here. And, let's say, the data hasn't actually 90 00:07:42,001 --> 00:07:47,549 gone into the cache, it can go pick up the new value here. 91 00:07:47,549 --> 00:07:53,024 And, the S bit is basically, it's going to tell us whether it's in flight or not. 92 00:07:53,024 --> 00:07:56,087 So, it's possible that we allocate into here, but the store isn't, at the end of 93 00:07:56,087 --> 00:07:59,007 the pipe yet or we're still calculating the data. 94 00:07:59,007 --> 00:08:02,061 So, we need to know that we're going to get a hit if we get a load out of order, 95 00:08:02,061 --> 00:08:05,093 and we need to go check for that. And that's, that's really why we have two 96 00:08:05,093 --> 00:08:13,797 bits here. And then, sometime in the future, we have 97 00:08:13,797 --> 00:08:18,080 to eventually move the data into the data cache cuz this structure will get full. 98 00:08:18,080 --> 00:08:22,268 If we sort of fire enough stores against it, it's going to get full. 99 00:08:22,268 --> 00:08:26,233 So, it's probably a good idea to go move the data into the data cache at some 100 00:08:26,233 --> 00:08:29,075 point. We can do that at our convenience, though. 101 00:08:30,089 --> 00:08:40,040 And, if the store aborts, this is important here, so if the store let's say, 102 00:08:40,040 --> 00:08:45,500 was speculative of itself, and, and some of the wrong side of the branch or 103 00:08:45,500 --> 00:08:48,063 something like that, we just remove it from this table. 104 00:08:53,029 --> 00:09:00,093 One interesting thing you hear is, if the data is in both this buffer and in the 105 00:09:00,093 --> 00:09:05,093 cache, which one takes precedence? So, the store buffer has the new, new 106 00:09:05,093 --> 00:09:10,492 values, so we have to look there. And then, finally, it's possible to get 107 00:09:10,492 --> 00:09:15,081 multiple stores in here which have the same tag, the same address. 108 00:09:15,081 --> 00:09:20,046 Is this how you're storing to the same address over and over again? 109 00:09:20,082 --> 00:09:26,037 And when you go to read, you don't need to go read the youngest value out of there 110 00:09:26,037 --> 00:09:31,051 cuz that's the most up to date one. So, this, look up through here is a little 111 00:09:31,051 --> 00:09:34,669 complicated. You actually need to do it in the program 112 00:09:34,669 --> 00:09:38,036 order. Okay. 113 00:09:38,036 --> 00:09:52,034 So, we'll, we'll stop here for today.