1 00:00:03,029 --> 00:00:07,674 Okay, let's look at the issue logic here and a pipeline diagram. 2 00:00:07,674 --> 00:00:14,023 So here we have OP A, B, C, D, E, F, so we have straight line code, no branches. 3 00:00:14,023 --> 00:00:17,092 And we have things flowing down the pipe. And we have our nice pipeline diagram. 4 00:00:17,092 --> 00:00:21,052 And one of the cool things is now that we have a two wide superscalar, we can 5 00:00:21,052 --> 00:00:25,017 actually violate a rule that we had before, which said two things cannot be in 6 00:00:25,017 --> 00:00:29,052 the same pipe stage at the same time, temporally, cuz time, time runs from left 7 00:00:29,052 --> 00:00:33,007 to right in this diagram. So here we have two things to do. 8 00:00:33,007 --> 00:00:38,033 You could go to stage two operands, or two operations or two instructions in the 9 00:00:38,033 --> 00:00:41,032 fetch stage. And, we're just gonna name, because we 10 00:00:41,032 --> 00:00:46,022 don't have a great name for these things, we're gonna call these A and, A0 and A1 11 00:00:46,022 --> 00:00:50,043 and B0 and B1 to represent the different execution unit stages. 12 00:00:50,043 --> 00:00:53,008 So in an ideal world, this is pretty sweet. 13 00:00:53,008 --> 00:00:58,043 In this, in this at least for this code here, we actually have a clocks per 14 00:00:58,043 --> 00:01:01,032 instruction of one half. That's pretty awesome. 15 00:01:01,051 --> 00:01:06,004 And as I said, we can have two, two instructions in the same stage of the 16 00:01:06,004 --> 00:01:06,074 pipe. Okay. 17 00:01:06,074 --> 00:01:11,032 Let's look at a little bit more complex code sequence here. 18 00:01:11,032 --> 00:01:14,089 We have add, loads, some more loads, an add, a load. 19 00:01:14,089 --> 00:01:19,098 Your issue logic, that swapping logic actually will have to move instructions 20 00:01:19,098 --> 00:01:23,055 around in this case. So we have this add and this load. 21 00:01:23,055 --> 00:01:28,004 Well this is actually easy. The add goes to the A unit, the load goes 22 00:01:28,004 --> 00:01:30,016 to the B unit. No problems there. 23 00:01:30,016 --> 00:01:34,040 Okay, so now we have the load. Uh-oh, loads in, we fetched it and it's in 24 00:01:34,040 --> 00:01:38,037 the instruction register zero. That means it wants to go to the A pipe, 25 00:01:38,037 --> 00:01:42,034 but we need to swap these two. So, you can see here, this is how we draw 26 00:01:42,034 --> 00:01:45,099 this. We actually say this add is going to the A 27 00:01:45,099 --> 00:01:48,046 pipe here and that's the opposite of what's going on there. 28 00:01:48,046 --> 00:01:52,031 But, there's still those stalls going on, at least in this example. 29 00:01:52,031 --> 00:01:59,070 And then finally here we actually are going to get a structural hazard. 30 00:02:01,057 --> 00:02:05,012 And the structural hazard introduces a stall. 31 00:02:05,012 --> 00:02:10,063 So, we fetch these two loads simultaneously, or we can only execute one 32 00:02:10,063 --> 00:02:14,096 load at a time. So, we need to stall one of the loads in 33 00:02:14,096 --> 00:02:18,074 the decode stage and push that out of the limit. 34 00:02:18,074 --> 00:02:24,096 So, it actually has a different pipeline diagram than the no stall, or the no 35 00:02:24,096 --> 00:02:31,030 conflicts, no structural hazard example. Okay, so a let's look at a, little bit 36 00:02:31,030 --> 00:02:34,085 more complex example here, a dual issue data hazard. 37 00:02:34,085 --> 00:02:40,036 What happens when you have data hazards? So unfortunately when you have data 38 00:02:40,036 --> 00:02:44,054 hazards you can actually, this is without any bypassing. 39 00:02:44,054 --> 00:02:50,340 This, this first, this first example, this first two instructions here don't have any 40 00:02:50,340 --> 00:02:54,167 data hazards. But here we have a write to register five, 41 00:02:54,171 --> 00:02:59,031 and a read from register five. And, this is a read after write hazard. 42 00:02:59,031 --> 00:03:04,035 And because we're not bypassing in this pipeline yet, we actually have to stall 43 00:03:04,035 --> 00:03:08,094 the second instruction waiting for that first one, even though we could have 44 00:03:08,094 --> 00:03:13,073 potentially executed at the same time, but there's a real data hazard there. 45 00:03:13,073 --> 00:03:18,007 So, we need to introduce stall cycles into the second instruction. 46 00:03:18,007 --> 00:03:22,693 Does this make sense to everybody? So, we're going to push out that add. 47 00:03:22,693 --> 00:03:27,065 If we have full bypassing, we still need to add stalling potentially. 48 00:03:28,002 --> 00:03:33,093 So no we don't have to wait for this value to get to the end of the pipe to go pick 49 00:03:33,093 --> 00:03:39,030 it up in the ALU but we can pull it back because we can bypass, let's say the add 50 00:03:39,030 --> 00:03:44,040 result after a zero, and what you see here is the same instruction sequence. 51 00:03:44,065 --> 00:03:51,214 But now it's bypassed from A0 into the decode stage and we can start going again, 52 00:03:51,214 --> 00:03:54,770 quicker. So bypassing is really helping us here, 53 00:03:54,770 --> 00:04:00,011 and it's crossed with the superscalarness, if you will. 54 00:04:00,036 --> 00:04:05,440 So wh-, wh-, what we mean by order matters is that here, we've interchanged these 55 00:04:05,440 --> 00:04:12,095 last two instructions. So we just flipped them, and we turned 56 00:04:12,095 --> 00:04:23,078 what was a write, excuse me, a read after write hazard into a write after read 57 00:04:23,078 --> 00:04:28,774 hazard, and because of that this actually pulls in by one cycle and we don't get the 58 00:04:28,774 --> 00:04:31,069 stall. So, just by changing the ordering in the 59 00:04:31,069 --> 00:04:35,061 instructions, it will change the data dependencies and that will actually change 60 00:04:35,061 --> 00:04:37,081 the ordering and change the execution length. 61 00:04:37,096 --> 00:04:41,773 Does that make sense, everybody, why we can actually interchange two instructions, 62 00:04:41,773 --> 00:04:45,192 and the data dependencies completely change, and we need to worry very 63 00:04:45,192 --> 00:04:52,093 different things about the data hazards. Okay, so I want to briefly wrap up about 64 00:04:52,093 --> 00:04:58,032 fetch logic and alignments. So this is, someone was alluding that, I 65 00:04:58,032 --> 00:05:04,001 think you were alluding to this. Let's look at some code here, and it's 66 00:05:04,001 --> 00:05:08,001 going to take jumps. So execute some instructions. 67 00:05:08,001 --> 00:05:13,009 So this is the address, this is the instruction, and we have a jump here to 68 00:05:13,009 --> 00:05:16,916 address, 100 hexadecimal. And then we execute one instruction, OPT 69 00:05:16,916 --> 00:05:21,805 E, and we jump to 204 hexadecimal, and then we jump, we execute one instruction 70 00:05:21,805 --> 00:05:27,082 and execute to, and jump to 300 and, or 3OC hexadecimal, and we just execute some 71 00:05:27,082 --> 00:05:33,158 stuff. Here is our cache, and let's say our 72 00:05:33,158 --> 00:05:38,459 cache, the block size is four instructions long. 73 00:05:38,459 --> 00:05:44,619 And we're going to look at how many cycles this takes to execute. 74 00:05:44,619 --> 00:05:49,653 So let's say there's no alignment constraints in the first, in the first 75 00:05:49,653 --> 00:05:52,090 case. So, in cycle zero here, we execute these 76 00:05:52,090 --> 00:05:56,444 two instructions, and we, and we fetch them from the instruction cache, and 77 00:05:56,444 --> 00:06:01,695 they're, they're aligned nicely together. There's nothing sort of weird going on, we 78 00:06:01,695 --> 00:06:05,688 just go pull them out. Okay, these next two instructions eight 79 00:06:05,688 --> 00:06:09,604 and C, those, those are next to each other, that's, that's great. 80 00:06:09,604 --> 00:06:14,593 And then, and then we jump somewhere else, to 100, and we're going to execute these 81 00:06:14,593 --> 00:06:19,517 two instructions that are next to each other and they're at the beginning of 82 00:06:19,517 --> 00:06:22,470 their lines, so that's great, no problem there. 83 00:06:22,470 --> 00:06:23,041 Hm. Okay. 84 00:06:23,041 --> 00:06:28,092 Now we start to get some weird stuff. Now we start to jump to sort of the middle 85 00:06:28,092 --> 00:06:33,053 of a cache line. In, in this example here we jump to sum 86 00:06:33,053 --> 00:06:36,094 address two or four. So our block size is said four 87 00:06:36,096 --> 00:06:40,046 instructions. We're sort of jumping, not to the first 88 00:06:40,046 --> 00:06:46,405 instruction in that block. So when he, Fully fleshed out, fetch unit, 89 00:06:46,405 --> 00:06:50,032 lets say, you can execute with any alignments. 90 00:06:50,032 --> 00:06:54,028 So, life is easy. We can just, fetch and we can execute 91 00:06:54,028 --> 00:07:00,006 these two instructions at the same time in the same cycle, in cycle three, we fetch 92 00:07:00,006 --> 00:07:03,095 both of those. Hm, that could get harder if we actually 93 00:07:03,095 --> 00:07:07,019 try to put some realistic constraints in that. 94 00:07:07,074 --> 00:07:13,039 Okay, now let's jump to a three the end of a, end of a cache block and we're gonna 95 00:07:13,039 --> 00:07:16,083 try to fetch these two instructions at the same time. 96 00:07:16,083 --> 00:07:20,066 So one is on this cache line, and one is on that cache line. 97 00:07:20,066 --> 00:07:24,075 Do we need to fetch two things from our cache at the same time? 98 00:07:27,000 --> 00:07:29,771 Yeah, we do. If we actually wanted to try to execute 99 00:07:29,771 --> 00:07:33,055 this instruction and that instruction at the same time. 100 00:07:33,055 --> 00:07:37,098 Let's say, for right now, this issue logic actually allows us to do that. 101 00:07:37,098 --> 00:07:41,041 Somehow, it's a dual ported instruction cache, we'll say. 102 00:07:41,041 --> 00:07:44,439 And then, finally, op five here. Or, 314, executes last. 103 00:07:44,439 --> 00:07:50,412 And, and it's just sort of fall through. There's no jumps or anything happening. 104 00:07:50,412 --> 00:07:56,673 So some things that can be really hard to actually make work out right are fetching 105 00:07:56,673 --> 00:08:01,402 across cache lines and possibly even fetching randomly inside of the cache 106 00:08:01,402 --> 00:08:04,607 line, depending on your fetching fetch unit logic. 107 00:08:04,607 --> 00:08:09,628 And, and like I said, we might need extra ports on the cache. 108 00:08:09,628 --> 00:08:15,704 Here, here is the this code executing, and as you can see we don't actually get any 109 00:08:15,704 --> 00:08:20,410 introduce stalls, which just sort of executes this, then this, then this, and 110 00:08:20,410 --> 00:08:24,374 this, and we execute two instructions every single cycle. 111 00:08:24,374 --> 00:08:29,077 Now let's look at lists of alignment constraints. 112 00:08:29,077 --> 00:08:34,963 So, here's our, here's our original example, and let's look at what, what, 113 00:08:34,963 --> 00:08:38,598 what we could possibly try to execute here. 114 00:08:38,598 --> 00:08:44,996 So we're jumping through call. We, we only use these two instructions 115 00:08:44,996 --> 00:08:50,062 from the middle of the line. So let's say we can only fetch a half of a 116 00:08:50,062 --> 00:08:55,458 block at a time or something like that in each cycle, because that's how wide our 117 00:08:55,458 --> 00:08:58,343 cache is. So what you might have to do in some 118 00:08:58,343 --> 00:09:03,330 architectures if you have alignment issues like that, and let's say you are not 119 00:09:03,330 --> 00:09:07,528 allowed to have a straddle. You'll actually have sort of extra data 120 00:09:07,528 --> 00:09:10,332 fetched that you are just never going to use. 121 00:09:10,332 --> 00:09:15,952 You are just throwing away this bandwidth. And also the cycles of this change. 122 00:09:15,952 --> 00:09:22,616 So, let's, let's look at this same code sequence and look at what happens when we 123 00:09:22,616 --> 00:09:27,087 go to execute it. So going, going back to this, so we 124 00:09:27,087 --> 00:09:31,670 execute, op A and op B. Okay, let's just go down the pipe. 125 00:09:31,670 --> 00:09:34,448 Okay. Life is, life is good. 126 00:09:34,448 --> 00:09:38,957 We get to this address eight here, eight hexadecimal. 127 00:09:38,957 --> 00:09:44,240 Well, we're going to swap that, because the jump needs to go down pipe A. 128 00:09:44,240 --> 00:09:51,013 But otherwise things, things are okay. Well, now, now we jump to, to the middle 129 00:09:51,013 --> 00:09:56,453 of a, of a line here. Hm, that starts to get more interesting. 130 00:09:56,453 --> 00:10:00,491 And we're gonna basically end up wasting cycles. 131 00:10:00,491 --> 00:10:05,596 So this will take seven cycles where before, we had this taking only five 132 00:10:05,596 --> 00:10:09,008 cycles. Cuz we've effectively introduced dead 133 00:10:09,008 --> 00:10:13,377 cycles, where we fetched instructions we just didn't use. 134 00:10:13,377 --> 00:10:17,977 So the three X's here show up as instructions we fetched. 135 00:10:17,977 --> 00:10:23,684 So like, for instance, this instruction or, the instruction at address 200 is 136 00:10:23,684 --> 00:10:26,920 that. We fetched it and we're not using it. 137 00:10:26,920 --> 00:10:34,404 And we fetched this two and we weren't using either of them, so having a complex 138 00:10:34,404 --> 00:10:41,937 fetched unit or not fully bypassed, or not fully alignment-happy fetch unit can cause 139 00:10:41,937 --> 00:10:49,012 some serious problems in our performance. Let's stop here for today and we'll talk 140 00:10:49,012 --> 00:10:50,046 about the rest next time.