1 00:00:03,098 --> 00:00:09,083 Okay so, very long instruction words, processors. 2 00:00:09,083 --> 00:00:15,519 Still wanna have high performance and we took all this hardware that was in a, out 3 00:00:15,519 --> 00:00:21,535 of our superscalar reserve through it out. Well, something's got to make up for it. 4 00:00:21,535 --> 00:00:25,540 And what makes up for it, is a very, very, very smart compiler. 5 00:00:25,540 --> 00:00:31,167 So, we put a lot of emphasis on the compiler, in these sorts of architectures. 6 00:00:31,167 --> 00:00:34,714 And, the compiler really has to do the scheduling. 7 00:00:34,714 --> 00:00:38,155 It has to do all of the dependency checking. 8 00:00:38,155 --> 00:00:45,737 It probably to avoid all the different data hazards, and this is just we're just 9 00:00:45,737 --> 00:00:49,828 getting started. We're gonna talk mostly next lecture about 10 00:00:49,828 --> 00:00:55,308 all of the different optimizations a compiler can do to try to approximate what 11 00:00:55,308 --> 00:00:59,718 a on-board superscalar can do, but by doing it statically. 12 00:00:59,718 --> 00:01:04,735 So it's a pretty cool trick if you take all that hardware, you put in the 13 00:01:04,735 --> 00:01:09,074 compiler, you run it once. And then, every time you go to execute the 14 00:01:09,074 --> 00:01:13,940 code, you don't have to go and recalculate all the dependencies. 15 00:01:13,940 --> 00:01:16,419 Sounds good. Okay. 16 00:01:16,419 --> 00:01:23,004 So let's see how we execute some code here and what is the sort of performance 17 00:01:23,004 --> 00:01:28,057 aspects to executing loop code on a very long instruction word processor. 18 00:01:29,079 --> 00:01:38,051 So here we have a very basic array increments. 19 00:01:38,051 --> 00:01:42,735 We're gonna take every element of this array. 20 00:01:42,735 --> 00:01:49,897 We're gonna increment it by the value C. [clears throat] And our, we run through 21 00:01:49,897 --> 00:01:53,468 aour compiler and here's the sequential code sequence. 22 00:01:53,468 --> 00:02:00,092 This has not been scheduled yet for VLIWs or for our VLIW architecture over here. 23 00:02:01,048 --> 00:02:07,049 So, we load some, we, we load the value. We increment our counters. 24 00:02:07,049 --> 00:02:13,040 We actually do the floating points, add. We store the value back. 25 00:02:13,040 --> 00:02:18,065 We increments the array index, and then we, we loop. 26 00:02:19,000 --> 00:02:26,053 Seems simple enough. Let's see how this gets scheduled here. 27 00:02:27,097 --> 00:02:34,009 Well, because the architecture and because the compiler knows about latencies here 28 00:02:34,009 --> 00:02:36,943 Let's say tthis load here has a few cycles of latency.. 29 00:02:36,943 --> 00:02:39,489 So it's gonna actually schedule this ad later. 30 00:02:39,489 --> 00:02:44,633 And the add, it's a floating-point add, this has a couple of cycles of latency 31 00:02:44,633 --> 00:02:51,241 also, so it schedules the consumer of that results here, this F2. 32 00:02:51,241 --> 00:02:56,250 Later. And, yeah, I don't know, you can sprinkle 33 00:02:56,250 --> 00:03:02,010 in the array additions and then the counter additions, kind of somewhere free, 34 00:03:02,010 --> 00:03:06,038 has lots of open slots. But let's say it schedules at the same 35 00:03:06,038 --> 00:03:09,087 time as the load in the store. So this is pretty cool. 36 00:03:09,087 --> 00:03:13,036 We are actually executing two wide parallels in here. 37 00:03:13,080 --> 00:03:19,028 And we didn't need all the extra overhead of in [inaudible] super scour. 38 00:03:21,037 --> 00:03:24,019 Oh, and we can just go to the branch somewhere. 39 00:03:24,093 --> 00:03:26,729 Okay. So, first question here how many floating 40 00:03:26,729 --> 00:03:31,041 point operations are we doing per cycle are we doing very well here? 41 00:03:31,041 --> 00:03:37,086 So, it looks, looks, looks pretty poor. We have one floating point operation here, 42 00:03:37,086 --> 00:03:42,051 just this ad. And we have one, two, three, four, five, 43 00:03:42,051 --> 00:03:45,016 six, seven, eight cycles? Yeah. 44 00:03:45,016 --> 00:03:50,640 So we're having 0.18 or 0.125 floating, like floating point operations per cycle. 45 00:03:50,640 --> 00:03:57,443 It's not that great. [inaudible[. 46 00:03:57,443 --> 00:03:59,034 You know we're actually having three instructions here. 47 00:03:59,034 --> 00:04:03,016 It's better than nothing but we're not really using the machine very well. 48 00:04:03,016 --> 00:04:07,292 Out of our superscalar, we'll probably try to take connstructions that are below here 49 00:04:07,292 --> 00:04:11,043 maybe intermix them, try to reorder a bunch of things and try to run faster. 50 00:04:12,024 --> 00:04:17,030 So what's a technique to go faster? Well, one of the big comply, as I said we 51 00:04:17,030 --> 00:04:20,051 have a lot of emphasis on compilers in this unit. 52 00:04:20,081 --> 00:04:25,080 One of the things the compiler can do, is it can actually unroll the loop. 53 00:04:25,080 --> 00:04:30,784 So here we have our loop, we unroll it, four times, and, now, we're going to sort 54 00:04:30,784 --> 00:04:36,033 of take the loop overhead, and we're gonna factor it out, so that it only happens 55 00:04:36,033 --> 00:04:39,020 once every four times. That sounds good. 56 00:04:39,020 --> 00:04:43,099 We're gonna do some more work here, per each [inaudible] loop. 57 00:04:44,024 --> 00:04:51,022 But things are a little more complicated. What happens here if N, big, upper case N 58 00:04:51,022 --> 00:04:54,080 here are terminating value is not a multiple of four. 59 00:04:54,080 --> 00:05:03,012 Well, we need to do something about that. We probably need to check sorta before we 60 00:05:03,012 --> 00:05:08,000 go to execute here whether we're, we're doing okay or whether we're actually are 61 00:05:08,000 --> 00:05:12,064 [inaudible] for convenient leaf and it's big enough and probably can run as 62 00:05:12,064 --> 00:05:17,058 multiple four for a long period of time and so is the last iteration you'll have 63 00:05:17,058 --> 00:05:20,093 to clean up. But we need to generate some clean up code 64 00:05:20,093 --> 00:05:27,095 and compiler is responsible to do this. So these compiler optimizations do take 65 00:05:27,095 --> 00:05:32,092 some effort. Okay, so let's look at the scheduling now 66 00:05:33,024 --> 00:05:40,079 of our loop unrolled code. We can do a bunch of loads upfront here. 67 00:05:40,079 --> 00:05:45,364 So we've inter, intermingled these loops. And what's kinda cool is because you pull 68 00:05:45,364 --> 00:05:49,717 out the loads and are thrown to the top and the stores been pushed into the 69 00:05:49,717 --> 00:05:52,567 bottom. And then put all the ads sort of in the 70 00:05:52,567 --> 00:05:55,178 middle. And maybe sprinkle the array update 71 00:05:55,178 --> 00:05:58,361 somewhere else. And we go to actually schedule that, we're 72 00:05:58,361 --> 00:06:07,005 going to do something similar. We're going to have the loads [inaudible] 73 00:06:07,005 --> 00:06:10,707 first, execute the floating point ads of the floating points stores and have the 74 00:06:10,707 --> 00:06:13,655 result. But what you could noticed here is we're 75 00:06:13,655 --> 00:06:18,981 actually starting to get some overlap, because we've unrolled, we can overlap 76 00:06:18,981 --> 00:06:24,394 this load and the first floating points addition 'cause we've effectively covered 77 00:06:24,394 --> 00:06:30,458 the latency of our functional units by putting other loop iterations during that 78 00:06:30,458 --> 00:06:34,044 time. So if you look at this schedule versus the 79 00:06:34,044 --> 00:06:38,966 schedule back here. We're just sort taking these dead cycles 80 00:06:38,966 --> 00:06:42,443 and we've put the other loop iterations in those dead cycles. 81 00:06:42,443 --> 00:06:47,273 In this loop unrolled case, we're incrementing this counters and the indexes 82 00:06:47,273 --> 00:06:52,572 not by four anymore. We're incrementing by however many times 83 00:06:52,572 --> 00:06:58,019 we've loop unrolled times the offset, so we're incrementing it by sixteen now. 84 00:06:58,050 --> 00:07:06,060 Does that make sense? 'Cause in, in this code here we were 85 00:07:06,060 --> 00:07:15,341 incrementing R2 by four, because that's the size of a single value is four bytes. 86 00:07:15,341 --> 00:07:19,301 So we have to sort of move our array index over by four. 87 00:07:19,301 --> 00:07:23,838 But now, because we're batching up all this work together. 88 00:07:23,838 --> 00:07:33,072 [clears throat] We actually have to move the, the index by a, a bigger value. 89 00:07:33,072 --> 00:07:38,162 So we're moving it by four 'cause we've unrolled four times, times the size of the 90 00:07:38,162 --> 00:07:42,672 data value, which was four. So we're moving it by sixteen. 91 00:07:42,672 --> 00:07:48,935 And, and, one of the nice things here if we look is in both the loads in the 92 00:07:48,935 --> 00:07:59,320 stores, we're using our register indirect addressing mode here to add in some 93 00:07:59,320 --> 00:08:03,348 offsets. So, we're actually offsetting, let's say, 94 00:08:03,348 --> 00:08:10,224 twelve plus this base register of R1 to figure out where we're actually doing the 95 00:08:10,224 --> 00:08:13,434 load from. But it's just, a convenient way that we 96 00:08:13,434 --> 00:08:19,010 don't have to compute a bunch of addresses [clears throat]. 97 00:08:19,010 --> 00:08:26,068 Okay, so going back here, we can see we're starting to overlap actual operations with 98 00:08:26,068 --> 00:08:29,688 other loop iterations. Well, that's really cool. 99 00:08:29,688 --> 00:08:33,076 So we're starting to get some performance here. 100 00:08:33,076 --> 00:08:36,080 So, let's, let's look at the performance. So, I ask the same question here. 101 00:08:36,080 --> 00:08:39,003 How many floating point operations per cycle? 102 00:08:39,359 --> 00:08:48,034 Hopefully, hopefully it's higher, one, two, three, four divided by one, two, 103 00:08:48,034 --> 00:08:55,025 three, four, five, six, seven, eight, nine, ten, eleven cycles Okay, so that's 104 00:08:55,025 --> 00:09:01,002 0.36 is a lot better than 0.125. This is good. 105 00:09:01,002 --> 00:09:07,043 Loop unrolling is helping us. But is this, is this everything? 106 00:09:07,043 --> 00:09:12,700 Or could we do more in our compiler? So these four compiler people came up even 107 00:09:13,064 --> 00:09:19,008 fancier idea, which is called software pipelining. 108 00:09:20,020 --> 00:09:25,029 So uses, uses a term we've seen before, pipelining, but does it in software. 109 00:09:25,055 --> 00:09:34,016 So the idea here is that instead of. Just having one loop, unrolling the loop, 110 00:09:34,016 --> 00:09:39,735 and then overlapping the iterations. We're actually going to take multiple of 111 00:09:39,735 --> 00:09:47,471 these schedules, and interleave them. And try to fill some of these empty spots. 112 00:09:47,471 --> 00:09:51,426 So let's, let's look at this. We're going to have the code unrolled four 113 00:09:51,426 --> 00:09:54,027 times. It's the same piece of code we had on the 114 00:09:54,027 --> 00:09:59,045 previous slide. And we're gonna draw our schedule that we 115 00:09:59,045 --> 00:10:02,095 had before. So here is the schedule we had before in 116 00:10:02,095 --> 00:10:06,044 purple. Okay, that doesn't look so bad. 117 00:10:07,000 --> 00:10:12,696 But now, we're going to schedule it with another iteration of the exact same four 118 00:10:12,696 --> 00:10:22,521 time unrolled loop shown here in green. So we've just overlapped this with, this 119 00:10:22,521 --> 00:10:28,696 other iterational loop. Well are we, are we done here? 120 00:10:28,696 --> 00:10:32,975 Well, not quite. We still have some open spots here. 121 00:10:32,975 --> 00:10:39,089 So let's try to overlap even another iteration, as shown here in red. 122 00:10:41,050 --> 00:10:48,074 Now the fix-up code that you need to have to do this correctly gets more complex 123 00:10:48,074 --> 00:10:51,465 'cause all of a sudden now you're overlapping multiple iterations, but as 124 00:10:51,465 --> 00:10:56,531 long as you don't modify some value, as long as you don't do a store, 125 00:10:56,531 --> 00:11:01,312 speculatively, or a store, you're probably okay, 'cause always just doing 126 00:11:01,312 --> 00:11:03,864 [inaudible]. Doing extra loads, doing extra work. 127 00:11:03,864 --> 00:11:08,560 You're doing extra work and you're filling slots thinking that you're not gonna have 128 00:11:08,560 --> 00:11:14,173 anything go wrong or thinking that the index variable N if you will, is a 129 00:11:14,173 --> 00:11:18,983 multiple of four and large, and you're not at the end. 130 00:11:18,983 --> 00:11:23,050 So let's, let's put some names to these things. 131 00:11:23,050 --> 00:11:29,055 [clears throat].. So we call the, beginning here the run up, 132 00:11:29,055 --> 00:11:35,054 the prologue. Here, we actually have our sort of actual 133 00:11:35,054 --> 00:11:38,099 iteration. You can see that the, sorry this is in 134 00:11:38,099 --> 00:11:42,004 green, that doesn't show up very well there are instructions here. 135 00:11:42,004 --> 00:11:45,053 There's ads. It's pretty full. 136 00:11:45,053 --> 00:11:49,059 We're actually doing a lot of work on our machine here. 137 00:11:49,059 --> 00:11:53,003 And then the epilog here is the, is when we're done. 138 00:11:53,003 --> 00:11:57,089 This when we're falling out. We're sort of at the last loop iteration 139 00:11:57,089 --> 00:12:02,086 of the outer loop, if you will. So let's, let's do some math and look at 140 00:12:02,086 --> 00:12:07,096 the performance of this. Okay, so let's ask the same question here. 141 00:12:07,096 --> 00:12:10,065 How many, floating point operations, per cycle? 142 00:12:11,015 --> 00:12:15,044 Well, we go look over here. We have one, two, three, four. 143 00:12:15,091 --> 00:12:20,075 And we have four cycles in our, in our tight loop case. 144 00:12:21,012 --> 00:12:23,024 That looks pretty good. That's cool. 145 00:12:23,024 --> 00:12:27,056 We just got a bunch of performance. But we have to do a lot of compiler 146 00:12:27,056 --> 00:12:31,575 optimization to make this work before we're able to use the [inaudible] machine, 147 00:12:31,575 --> 00:12:36,641 and we're able to overlap three different iterations of this of this loop, that we 148 00:12:36,641 --> 00:12:40,039 also did a software transform to unroll it twice. 149 00:12:40,094 --> 00:12:44,061 So this is called a software pipeline, pipelining. 150 00:12:44,061 --> 00:12:49,091 And have the nice picture in here to sort of show what's going on visually. 151 00:12:49,091 --> 00:12:56,001 So we're gonna have time on the horizontal axis, and we'll have sort of activity of 152 00:12:56,001 --> 00:13:01,089 how many instructions are executing or something like that on the vertical axis 153 00:13:01,089 --> 00:13:06,075 or, or shown here as performance. When you run multiple loop unroll 154 00:13:06,075 --> 00:13:13,914 iterations, you have, starts from start up, you have the actual you are in the 155 00:13:13,914 --> 00:13:20,760 loop, and then you come down from that. And that's, that's better than having lots 156 00:13:20,760 --> 00:13:26,570 of start up and sort of come down with small loop iteration portion here in the 157 00:13:26,570 --> 00:13:30,213 middle. [clears throat] But when we go look at the 158 00:13:30,213 --> 00:13:37,780 software pipelined, we can overlap, basically one iterational loops execution 159 00:13:37,780 --> 00:13:46,946 with or one loop start up or, a, a with loop iteration of another loop Now we can 160 00:13:46,946 --> 00:13:54,270 actually execute sort of our prologue, execute multiple iterations, very tight, 161 00:13:54,270 --> 00:14:06,034 and then have our epilogue here. Yeah, so. 162 00:14:06,034 --> 00:14:13,163 Our software pipelining [inaudible] start up and wind down costs. 163 00:14:13,163 --> 00:14:19,567 Once for the execution of the loop and not every iteration of the loop. 164 00:14:19,567 --> 00:14:22,994 So that's, that's, that's fun. That's cool. 165 00:14:22,994 --> 00:14:29,710 We're getting performance. If only the world was dense loops, life 166 00:14:29,710 --> 00:14:36,678 would be easy. Alas, the world is not all loops. 167 00:14:36,678 --> 00:14:46,581 If we just had a processor, which just did enter array calculations and all of our 168 00:14:46,581 --> 00:14:50,836 problems in the world were just dense array computations life would be really 169 00:14:50,836 --> 00:14:51,947 easy. But they're not. 170 00:14:51,947 --> 00:14:56,222 A lot a time code has lots of branches. It has if than else clauses. 171 00:14:56,222 --> 00:15:01,064 And here so graphically show something like a if than else. 172 00:15:01,064 --> 00:15:06,268 So we have some piece a code that makes a decision and executes the code on the left 173 00:15:06,268 --> 00:15:11,983 or executes the code on the right, based on an if statement. 174 00:15:11,983 --> 00:15:18,462 So this is the if true statement and this is the ELF's clause. 175 00:15:18,462 --> 00:15:24,455 [clears throat] and data dependent branches are a problem typically for very 176 00:15:24,455 --> 00:15:31,730 long instruction word processors. Now, why is that? 177 00:15:31,730 --> 00:15:35,582 Hm. Well in a, out of word processor, you can 178 00:15:35,582 --> 00:15:40,424 try to execute, code sort of around the branches and move instructions above the 179 00:15:40,424 --> 00:15:44,853 branch and below the branch. But if you're doing static scheduling when 180 00:15:44,853 --> 00:15:49,616 you hit this branch, and let's say it's a hard to predict branch. 181 00:15:49,616 --> 00:15:55,124 You can't really do anything because you packed a bunch of instructions next to 182 00:15:55,124 --> 00:15:57,806 each other and they need to execute atomically. 183 00:15:57,806 --> 00:16:02,303 So, thinking measure of move code up and down across that branch. 184 00:16:02,303 --> 00:16:07,061 Superscalar can do that, 'cause it has its instruction window, it has a bunch of 185 00:16:07,061 --> 00:16:10,052 techniques a bunch of hardware to be able to do that. 186 00:16:10,052 --> 00:16:13,089 But in our VLIW processor, that's a problem. 187 00:16:14,096 --> 00:16:21,003 So I wanted to introduce a, a compiler, an important compiler nomenclature here, 188 00:16:21,003 --> 00:16:25,023 which is important for this class, it's called a basic block. 189 00:16:25,023 --> 00:16:30,019 So what is a basic block? A basic block is a piece of code which has 190 00:16:30,019 --> 00:16:35,001 a single entry and a single exit. So this is a basic block, it has one 191 00:16:35,001 --> 00:16:38,071 entry, and one exit. And why is single entry important? 192 00:16:38,071 --> 00:16:44,023 Well, if you can jump into the middle of this piece of code, the compiler cannot 193 00:16:44,023 --> 00:16:47,094 necessary reorder the instructions inside this block. 194 00:16:48,089 --> 00:16:53,069 If you have multiple exits, let's say you exit here, the compiler can't push 195 00:16:53,069 --> 00:16:58,029 instructions below that exit point. But if you have a basic block, the 196 00:16:58,029 --> 00:17:03,075 compiler basically knows that these, this instruction sequence is going to execute 197 00:17:03,075 --> 00:17:06,096 effectively, effectively atomic, but not actually atomic. 198 00:17:06,096 --> 00:17:10,038 I mean that you can have other things going on inside there. 199 00:17:10,038 --> 00:17:13,501 But from a compiler perspective it can reorder the instructions around in here to 200 00:17:13,501 --> 00:17:18,061 get better performance. Hm, okay. 201 00:17:19,009 --> 00:17:22,098 So loops, loops are easy. We can solve for pipeline. 202 00:17:22,098 --> 00:17:27,555 We can unroll. Squirrelly code if and else spaghetti code 203 00:17:27,555 --> 00:17:33,025 are hard. So compiler guys came up with some fancy 204 00:17:33,025 --> 00:17:40,063 tricks to make VLIWs work better and take advantage of some of the code motion that 205 00:17:40,063 --> 00:17:47,022 out of a force superscalar does but in the compiler and not in dynamically in 206 00:17:47,022 --> 00:17:52,029 hardware. And one of the more famous ways to do this 207 00:17:52,029 --> 00:17:59,146 is something called [inaudible] scheduling, which was John Ellis' who was 208 00:17:59,146 --> 00:18:06,095 one of Jon Fishers' student's thesis work. This was in the Bulldog compiler out of 209 00:18:06,095 --> 00:18:13,251 [inaudible] So, what you do is you profile the code and, you compare the 210 00:18:13,251 --> 00:18:17,412 probabilities that these branches go the one way or the other way. 211 00:18:17,412 --> 00:18:22,889 So, let's say profiling, this is not a something hardware is doing at run time, 212 00:18:22,889 --> 00:18:27,421 this is something that you do with the program while you are sort of still back 213 00:18:27,421 --> 00:18:31,443 in the compiler stage. You take the program, the compiler goes 214 00:18:31,443 --> 00:18:35,895 and runs it on some input given, given input set and comes up with the 215 00:18:35,895 --> 00:18:41,300 probabilities of which way things go. And then what you do is, you come up, you 216 00:18:41,300 --> 00:18:47,253 take this profile information and you come up with some guess at what is the most 217 00:18:47,253 --> 00:18:51,501 probable one. And we're going to circle that here. 218 00:18:51,501 --> 00:18:56,084 And say. These darkened edges are the most probable 219 00:18:56,084 --> 00:19:01,070 sort of path through the squirrelly piece of code given this is the entry point. 220 00:19:03,007 --> 00:19:09,035 Now, this doesn't mean that you can't have branches that sort of branch out of this, 221 00:19:09,035 --> 00:19:14,612 but if you do you need to have some fix-up code 'cause what we're about to do is 222 00:19:14,612 --> 00:19:18,082 we're gonna take this entire sort of, big [inaudible] of code here. 223 00:19:18,082 --> 00:19:25,082 We're gonna remove all of the branches and we're gonna schedule for our VLIW 224 00:19:25,082 --> 00:19:29,054 processor as one big monolithic piece of chunk code. 225 00:19:29,054 --> 00:19:34,033 And by doing this we can move instructions, let's say, that are down 226 00:19:34,033 --> 00:19:39,731 here, which this opens last to executing your early portion of this codes sequence, 227 00:19:39,731 --> 00:19:43,908 we can move them up. And likewise we can move things that use 228 00:19:43,908 --> 00:19:50,150 the resolve of long latency instructions up here and push it down across branches. 229 00:19:50,150 --> 00:19:56,586 So our out-of-order superscalar does this with branch speculation, but our compiler 230 00:19:56,586 --> 00:20:00,431 can do this on our VLIW processor using trace schedule. 231 00:20:00,431 --> 00:20:05,968 But when do this, which be careful because there's always a possibility that while 232 00:20:05,968 --> 00:20:09,038 unlikely you can still branch the other way. 233 00:20:09,038 --> 00:20:15,936 So typically, the way this is done is you have some form of fix-up code that you 234 00:20:15,936 --> 00:20:21,640 branch away, you have to sort of fix up anything that was after the branch that 235 00:20:21,640 --> 00:20:26,408 made a committed change if you will, to the, the processor state. 236 00:20:26,408 --> 00:20:31,475 And you sort of roll that back somehow. So, we're basically in software, doing the 237 00:20:31,475 --> 00:20:36,527 rollback case from number our out-of-order superscalar. 238 00:20:36,527 --> 00:20:41,044 So instead of taking the, architectural register file and copying it to the 239 00:20:41,044 --> 00:20:46,625 physical register file on branch mispredict, instead our compiler generates 240 00:20:46,625 --> 00:20:51,634 a code sequence which does that same operation if you were to branch away 241 00:20:51,634 --> 00:20:54,976 there. And we'll roll back, only the only the 242 00:20:54,976 --> 00:20:59,150 certain register that needs to be rolled back and only the memory state that needs 243 00:20:59,150 --> 00:21:01,757 to be rolled back. So, that's pretty cool, so we can 244 00:21:01,757 --> 00:21:05,814 basically take all the functionality that was dawn in our out-of-order superscalar 245 00:21:05,814 --> 00:21:08,018 and put it in the software using trace scheduling.