1 00:00:03,340 --> 00:00:06,917 Okay. So now, we get to, what can software do for my cache? 2 00:00:06,917 --> 00:00:12,812 What can software do for my memory system? Well, surprisingly, a lot. 3 00:00:12,812 --> 00:00:19,692 And compiler people actually do spend a fair amount of time thinking about this. 4 00:00:22,117 --> 00:00:28,610 So, one of the things you'll actually see is you'll see the compiler many times, 5 00:00:28,610 --> 00:00:33,079 Pad out data structures. So, if you disassemble a piece of Ccode 6 00:00:33,079 --> 00:00:38,248 that you have, you'll actually see a compiler many times will insert extra 7 00:00:38,248 --> 00:00:42,717 bytes, which wastes memory but it's there to improve performance. 8 00:00:42,717 --> 00:00:47,886 And how this improves performance is it will make the access patterns for 9 00:00:47,886 --> 00:00:53,752 instance, either not conflict in our cache or conflict in our cache in a strange way 10 00:00:53,752 --> 00:00:58,711 which is beneficial or will somehow exploit either temporal or spatial 11 00:00:58,711 --> 00:01:05,310 locality. So, you can do lots of tricks here to 12 00:01:05,310 --> 00:01:12,116 either improve spatial or temporal locality. one of the things I do want to 13 00:01:12,116 --> 00:01:18,424 talk about is sort of a hybrid software, hardware solution here that you can 14 00:01:18,424 --> 00:01:24,027 actually add instructions to your processor, to give hints. 15 00:01:24,027 --> 00:01:28,706 Actually, allow the compiler to give hints to the architecture. 16 00:01:28,706 --> 00:01:34,177 Whether to pull something into the cash or not pull something into the cash. 17 00:01:34,177 --> 00:01:39,936 So, many times the compiler will actually know, or programmer will know, I need to 18 00:01:39,936 --> 00:01:45,356 use this data, exactly once. Now, if you, if you know you only need to 19 00:01:45,356 --> 00:01:48,500 use that piece of data once, should you pull it into your cache? 20 00:01:49,840 --> 00:01:51,584 No. You're never going to reuse it. 21 00:01:51,584 --> 00:01:55,804 You're never, you're never going to get a cache hit, you're just going to take the 22 00:01:55,804 --> 00:02:00,192 cache miss and then some point later in the future, you're just going to evict it, 23 00:02:00,192 --> 00:02:02,781 and you're, and it's going to hurt your miss rate. 24 00:02:02,781 --> 00:02:06,967 So, instead, There's been an introduction actually, 25 00:02:06,967 --> 00:02:12,799 sometimes people call this non-temporal hints or no allocate bits, so one way this 26 00:02:12,799 --> 00:02:18,357 actually works out sometimes is certain pages in our memory system are marked as 27 00:02:18,357 --> 00:02:21,445 being non-temporal and it won't be cacheable. 28 00:02:21,651 --> 00:02:26,797 You can do this as lets say, six as actually non-cacheable bits in the page 29 00:02:26,797 --> 00:02:30,375 tables for instance. A finer grain way of doing this is 30 00:02:30,375 --> 00:02:35,308 actually having special load instructions or special store instructions which say do 31 00:02:35,308 --> 00:02:40,067 not ever bring this into my cache or do not pollute my cache with this data cuz I 32 00:02:40,067 --> 00:02:44,768 never intend to go read this again. And lot's of architectures have started to 33 00:02:44,768 --> 00:02:51,150 include this now as a optimization. Another fun thing you can do is if you 34 00:02:51,150 --> 00:02:56,542 know, if the compiler or the software system, or the programmer knows you're 35 00:02:56,542 --> 00:03:02,229 never going to use the data again, why let it sit there and rot in your cache? 36 00:03:02,229 --> 00:03:08,535 Get it out, get it out as quickly as possible, now there's a couple different 37 00:03:08,535 --> 00:03:13,651 ways to do this. So, one of the, the, the instructions that you can add to the 38 00:03:13,651 --> 00:03:19,095 architecture and have a software/hardware approach to this and you'll actually add 39 00:03:19,095 --> 00:03:22,899 something that I think that an X86 is called actually called flushing and 40 00:03:22,899 --> 00:03:25,794 validate. So, the idea is that you, you may still 41 00:03:25,794 --> 00:03:30,607 have dirty data in that location. And you want to flush it out the main 42 00:03:30,607 --> 00:03:35,081 memory or flush out my cache and invalidate the data in the cache. 43 00:03:35,081 --> 00:03:40,029 And by giving the compiler or the programmer primary control of this, you 44 00:03:40,029 --> 00:03:45,587 can effectively kill data that is in the cache and make it clean or make it clean 45 00:03:45,587 --> 00:03:48,095 and not valid in the cache, That line, 46 00:03:48,095 --> 00:03:52,840 So when you go to bring in new data, you don't have to evict something. 47 00:03:52,840 --> 00:03:56,511 So, you are effectively, It's kind of the inverse of prefetch. 48 00:03:56,511 --> 00:03:58,986 In software prefetch, you bring something in early. 49 00:03:58,986 --> 00:04:05,296 Here, we push something out early. And that, that can be a, a pretty big 50 00:04:05,296 --> 00:04:12,139 benefit. One of the other things is that not I 51 00:04:12,139 --> 00:04:17,229 don't quite have this on, I don't have a slide here, but there's a, there's a 52 00:04:17,229 --> 00:04:21,029 corollary to this. There's a, lots of times architectures 53 00:04:21,029 --> 00:04:26,255 will have instruction called basically it's just like invalid instruction. 54 00:04:26,255 --> 00:04:30,456 It'll invalidate the data. And this is, the semantics of this are a 55 00:04:30,456 --> 00:04:34,337 little bit, you have to be a little bit careful with, because this unviel does not 56 00:04:34,337 --> 00:04:38,313 flush the backing data out to main memory. You're changing architectural state with 57 00:04:38,313 --> 00:04:40,757 this instruction. So, if there was dirty data, you're 58 00:04:40,757 --> 00:04:44,399 effectively deleting that dirty data. Sometimes people will actually use 59 00:04:44,559 --> 00:04:48,832 instructions similar to that instead of a flushing and invalidate, if you know that 60 00:04:48,832 --> 00:04:52,571 you just have scratchpad memory where you don't need the backing data. 61 00:04:52,571 --> 00:04:56,898 And that's a way to, you don't even have to waste the writeback bandwidth to your 62 00:04:56,898 --> 00:05:00,797 L2, or your L3, or you main memory. You can just say this data is no, is now 63 00:05:00,797 --> 00:05:02,720 dead. I never want to use this again. 64 00:05:02,720 --> 00:05:04,910 So, that's typically an invalid instruction. 65 00:05:04,910 --> 00:05:09,737 And invalid instructions will lots of times take a cache line number instead of 66 00:05:09,737 --> 00:05:14,565 an address, sometimes it will take an address and it will validate the entire 67 00:05:14,565 --> 00:05:19,518 cache line. Unfortunately, you need to take the entire cache line or cache block 68 00:05:19,518 --> 00:05:24,927 and throw it out. And the, finally, the inverse instruction 69 00:05:24,927 --> 00:05:30,279 of that instruction which a couple architectures have it. 70 00:05:30,279 --> 00:05:34,381 I know Alpha has, I think, X86s later added it. 71 00:05:34,649 --> 00:05:41,560 I don't know what it's called on X86. on alpha, it's called it's like something 72 00:05:41,560 --> 00:05:45,014 called like write 064, I think is what it's called. 73 00:05:45,014 --> 00:05:50,322 Basically, what you can do is if you have some data and you want to write some data 74 00:05:50,322 --> 00:05:54,735 to your cache but you don't want to have to pull in the backing data, 75 00:05:54,735 --> 00:05:59,851 Because that will just effectively add bandwidth out to your main memory system 76 00:05:59,851 --> 00:06:04,840 or out to your higher levels of cache. Instead, what you can do is there's a 77 00:06:04,840 --> 00:06:10,276 special instruction which will just fill the whole entire cache line with zeros out 78 00:06:10,276 --> 00:06:14,105 of nowhere. And what's nice about this, is now that 79 00:06:14,105 --> 00:06:18,436 you've filled in an entire cache line of zeros, you can go do rights to those 80 00:06:18,436 --> 00:06:22,946 zeroes and fill in the useful data and never have to have gone an fetched the 81 00:06:22,946 --> 00:06:25,420 background data from memory. So, it saves bandwidth. 82 00:06:25,760 --> 00:06:31,847 So, these are all different hardware software techniques together that the 83 00:06:31,847 --> 00:06:38,513 compiler can use to optimize performance. But yeah. 84 00:06:38,513 --> 00:06:43,288 I do, do want to finally just one last thing about this improving spatial 85 00:06:43,288 --> 00:06:47,685 locality and temporal locality. If you ever go disassemble a C program, 86 00:06:47,685 --> 00:06:51,957 you'll definitely see structures. If you, if you run a C program with 87 00:06:52,146 --> 00:06:57,729 feedback driven compilation, You'll see that, a lot of times, they'll 88 00:06:57,729 --> 00:07:01,504 actually reorganize the data structure and languages. 89 00:07:01,504 --> 00:07:07,131 So, for instance, if you go look at the careful description of what C says about 90 00:07:07,131 --> 00:07:13,115 this some languages allow you to insert padding or reorganize some of the fields 91 00:07:13,115 --> 00:07:17,153 in a data structure. And by doing that, you can get more 92 00:07:17,153 --> 00:07:22,861 performance by effectively reorganizing data so you put all of the things that you 93 00:07:22,861 --> 00:07:28,844 access in a row, right next to each other and that will get you for instance spatial 94 00:07:28,844 --> 00:07:30,121 locality. Okay. 95 00:07:30,121 --> 00:07:37,620 So now, we get to go to some of the fun software optimization strategies up here. 96 00:07:39,560 --> 00:07:42,900 So, let's go to the, the, the first one here. 97 00:07:43,160 --> 00:07:46,375 I'm going to give you guys a second to look at this code. 98 00:07:46,553 --> 00:07:50,900 Look at the, look at the top example first or the top piece of code first. 99 00:07:52,780 --> 00:08:01,879 We have a loop around a loop. For j is less than N, for i is less than 100 00:08:01,879 --> 00:08:07,480 M. Okay. 101 00:08:07,840 --> 00:08:18,275 So, it's doubly index array x here were effectively striding through this array, 102 00:08:18,275 --> 00:08:23,640 somehow, or multiplying every element in the array by two. 103 00:08:25,160 --> 00:08:33,342 Seems, seems basic enough. So, one of the questions that comes along 104 00:08:33,342 --> 00:08:45,364 here is in C, the default layout of x, it's a doubly indexed array and we're 105 00:08:45,364 --> 00:08:52,804 walking, by the higher index. It's the, the inner, inner loop here. 106 00:08:55,280 --> 00:09:04,640 So, if we, if we look what's happening here, this is a 2D array x. 107 00:09:09,413 --> 00:09:21,540 And i goes that way and j goes, goes that way. 108 00:09:21,960 --> 00:09:35,560 Now, we're going and we're walking through this array by i. So, we're going to walk 109 00:09:35,560 --> 00:09:47,180 down, walk down, in that pattern. By default, in C, this array gets laid out 110 00:09:47,180 --> 00:09:55,440 such that the second index gets sort of laid out contiguously. 111 00:09:55,780 --> 00:10:01,640 And the higher order index gets laid out non-continuously. 113 00:04:39,993 --> 00:10:10,765 would be address two, this would be address three, four, five, assuming these 114 00:10:10,765 --> 00:10:16,712 were all bytes. Let me come back over here, six, seven, 115 00:10:16,712 --> 00:10:22,941 eight, nine, ten, eleven. Unfortunately, with the way that the top 116 00:10:22,941 --> 00:10:26,240 loop is written, we're effectively hitting, 117 00:10:28,420 --> 00:10:32,740 What we're fighting, spatial locality of the cache lines. 118 00:10:33,940 --> 00:10:41,330 Huh, that's not good. Well, lo and behold, if we do the exact 119 00:10:41,330 --> 00:10:49,424 same operation and our compiler can easily detect this, and we just flip the order 120 00:10:49,424 --> 00:10:57,122 that we strived through the array, Instead, we'll go down through the array, 121 00:11:01,465 --> 00:11:08,813 and get spatial locality perfectly. So, by doing a simple loop interchange, 122 00:11:08,813 --> 00:11:16,194 life, life gets a lot better. Actually, before we move on here, I do 123 00:11:16,194 --> 00:11:21,660 want to say this brings up a good example of packing of data. 124 00:11:22,060 --> 00:11:34,060 I have something here which says, Which has six bytes in a row. 125 00:11:34,600 --> 00:11:40,720 Zero, one, two, three, four, five. Now, one optimization that compilers will 126 00:11:40,720 --> 00:11:44,545 many times do is they'll actually had out the structure. 127 00:11:44,545 --> 00:11:50,487 Let's say, out to eight bytes cuz it makes the math a lot easier to go strive through 128 00:11:50,487 --> 00:11:54,101 the array and indices. So, it's an example of one of the, the 129 00:11:54,101 --> 00:11:57,982 things that I showed before. It will also even sometimes have better 130 00:11:57,982 --> 00:12:01,520 cache performance. Cuz you'll pull an entire line at a time 131 00:12:01,520 --> 00:12:05,801 versus straddling different cache lines, if you have an odd, a non-naturally 132 00:12:05,801 --> 00:12:10,823 aligned structure like we had before. That may not happen in this,, it may not 133 00:12:10,823 --> 00:12:13,620 hurt in this example for the striving for the entire array. 134 00:12:13,620 --> 00:12:19,290 But other examples it could, it could definitely be harmful if you are trying to 135 00:12:19,290 --> 00:12:22,060 operate a lot within one row, for instance. 136 00:12:22,960 --> 00:12:26,438 Okay. Before we move off of loop interchange, 137 00:12:26,438 --> 00:12:31,340 does that oh, actually, what type of locality does this improve? 138 00:12:31,600 --> 00:12:38,280 Spatial. Yay, We,e we drilled that one down pretty well. 139 00:12:41,620 --> 00:12:45,670 Okay. So, let's look at our next piece of code 140 00:12:45,670 --> 00:12:56,696 here. For I, i less than N, we multiply b of i times c of i and deposit it into a of 141 00:12:56,696 --> 00:13:01,686 i. And in the next loop you, we do something 142 00:13:01,686 --> 00:13:07,348 very similar. We take c of i multiply by a of i and 143 00:13:07,348 --> 00:13:16,820 deposit it into d of i. Well, going to say, how does this happen? 144 00:13:16,820 --> 00:13:20,660 This seems like a brain dead way to write this piece of code. 145 00:13:21,140 --> 00:13:25,280 Yes and no. Maybe you do something in between here, 146 00:13:25,920 --> 00:13:31,032 Between these two four loops. And it may not be readily apparent to 147 00:13:31,032 --> 00:13:36,831 everybody that you can actually do anything about this, like because we have 148 00:13:36,831 --> 00:13:42,173 this piece of code here which goes and reads, I don't know, all they, for 149 00:13:42,173 --> 00:13:46,397 instance. Well, a compiler can recognize this, and 150 00:13:46,397 --> 00:13:49,720 actually say, well, we just wrote this entire array. 151 00:13:50,160 --> 00:13:54,860 And now, we go back and read this entire array and we read c before also. 152 00:13:55,520 --> 00:14:00,589 That's not, that's not very, very good. Because we're effectively, let's say, it 153 00:14:00,589 --> 00:14:05,338 caches out, assuming these arrays are large, we're going to read and see an 154 00:14:05,338 --> 00:14:10,311 element of c read an element of b deposit in a We're basically going to strive for 155 00:14:10,311 --> 00:14:14,900 the cache and these, all these three arrays are going to kick each other out. 156 00:14:16,184 --> 00:14:21,774 By the time we get back to here, A of zero is definitely not going to be in 157 00:14:21,774 --> 00:14:25,154 our cache. Maybe a of the last element will be in our 158 00:14:25,154 --> 00:14:29,498 cache, or a of N, n will be in our cache, or a of n minus one will be in our cache, 159 00:14:29,498 --> 00:14:32,060 but a of zero is not going to be in our cache. 160 00:14:33,520 --> 00:14:41,242 So, compiler can think real hard about this, and do what's called loop fusion and 161 00:14:41,242 --> 00:14:47,357 pull these statements apart. Such that you take b of i times c of i put 162 00:14:47,357 --> 00:14:51,907 it here and then use that, you're probably actually going to do that on a temporary 163 00:14:51,907 --> 00:14:56,403 variable, you're probably not going to do that with your main memory if you have a 164 00:14:56,403 --> 00:15:00,953 of i and a of i in statements right next to each other, but you still have to write 165 00:15:00,953 --> 00:15:05,120 a of i, we'll say for program correctness. And you take a of i and c of i and 166 00:15:05,120 --> 00:15:08,190 multiply them times each other and put them into d here. 167 00:15:08,190 --> 00:15:13,527 What's nice about this is you're actually going to bring in a cache line of b of i, 168 00:15:13,527 --> 00:15:18,422 we're going to cache line of c of i, Do the multiplication. 169 00:15:18,422 --> 00:15:24,159 You're going to bring in, well, you already have a of i cuz that's your 170 00:15:24,159 --> 00:15:27,135 destination and you just have that variable. 171 00:15:27,135 --> 00:15:32,411 And then, you can write it over into d of i, and strive through all those cache 172 00:15:32,411 --> 00:15:36,200 lines together. And effectively, you are not having to go 173 00:15:36,200 --> 00:15:41,316 re-fetch a of zero multiple times. Instead, you've fetched it once. 174 00:15:41,316 --> 00:15:47,891 And what's also nice is you don't have to go refetch, let's say, c of i twice, or c 175 00:15:47,891 --> 00:15:51,707 of zero twice. It's already there in your cache. 176 00:15:51,707 --> 00:15:54,223 Okay. So, question comes up here. 177 00:15:54,223 --> 00:15:58,120 What type of locality does this improve? . 178 00:16:02,780 --> 00:16:08,066 Temporal, yes. Because before we were bringing something into the cache, kicking 179 00:16:08,066 --> 00:16:11,320 it out of cache, bringing it back into the cache. 180 00:16:11,600 --> 00:16:16,674 Both these actually exhibit decent spatial locality actually already, and we're able 181 00:16:16,674 --> 00:16:21,498 to change the spatial locality of what's going on here, they're both striving 182 00:16:21,498 --> 00:16:25,320 through raised spatially. But temporaly, we don't have to kick. 183 00:16:25,600 --> 00:16:30,807 Every we don't have to kick a and c out of the cache effectively, twice, so we're 184 00:16:30,807 --> 00:16:35,199 bringing it into the cache twice. So, we are able to exploit a bunch of 185 00:16:35,199 --> 00:16:37,755 spatial locality here. Oh, sorry. 186 00:16:37,755 --> 00:16:41,720 [LAUGH] This will explain a lot of temporal locality here. 187 00:16:42,620 --> 00:16:48,566 Okay, so now, now we get to something a little bit more sophisticated, matrix 188 00:16:48,566 --> 00:16:53,019 multiply. Common thing for computers to do, 189 00:16:53,019 --> 00:16:57,650 You want to take at least in dense matrix codes or many algebra source codes. 190 00:16:57,650 --> 00:17:00,737 You take matrix a, you want to multiply it by matrix b. 191 00:17:00,737 --> 00:17:03,468 So, this is like traditional MatLab kind of code. 192 00:17:03,468 --> 00:17:07,802 You have some Matlab, you have a bigger array, you want to multiply by another 193 00:17:07,802 --> 00:17:12,136 array And we would take to the matrices, This applies to higher dimensional 194 00:17:12,136 --> 00:17:13,857 matrices also. But, 195 00:17:13,857 --> 00:17:17,716 It's easy to draw two-dimensional things on a two-dimensional screen. 196 00:17:17,716 --> 00:17:21,575 It's hard to draw four-dimensional thing on a two-dimensional screen. 197 00:17:21,575 --> 00:17:24,960 So, we'll start with that. Here's our naive implementation. 198 00:17:24,960 --> 00:17:29,412 We have we have three four loops going on here. 199 00:17:29,412 --> 00:17:35,035 And what we're doing is we're doing the traditional way that you learn how to do 200 00:17:35,035 --> 00:17:39,270 matrix multiplication. You take a row, multiply it by a column, 201 00:17:39,270 --> 00:17:42,945 And that's that spot. Take this row and multiply by that column, 202 00:17:42,945 --> 00:17:46,445 that's that result. This row by that column, that's that spot. 203 00:17:46,445 --> 00:17:49,887 This row by that column, there and you just work through it. 204 00:17:49,887 --> 00:17:54,145 And, you're basically, when you, let's, let's step through actually what the 205 00:17:54,145 --> 00:17:57,879 multiplication actually is. You take this value, multiply by that 206 00:17:57,879 --> 00:18:02,371 value, take this value, multiply by that value, and you continue going down and 207 00:18:02,371 --> 00:18:04,880 summing each of the multiplication results. 208 00:18:06,140 --> 00:18:11,887 And that's how we ultimately end up with this point here or this value here. 209 00:18:11,887 --> 00:18:21,200 So, this is the input, input, output. Okay, let's think about what this does to 210 00:18:21,200 --> 00:18:24,227 our cache. Okay. 211 00:18:24,227 --> 00:18:29,089 Let's, let's say, our cache line. The, these elements here are, are bigger. 212 00:18:29,089 --> 00:18:34,159 They're not just one byte, or one word. They're couple words, or, or, they're 213 00:18:34,159 --> 00:18:37,353 maybe like a long, long or something like that. 214 00:18:37,353 --> 00:18:40,687 They're a big data structure or a big, big value. 215 00:18:40,687 --> 00:18:45,480 And let's say, only three values fit in a cache block or a cache line. 216 00:18:46,760 --> 00:18:54,851 So, we go to this multiply, we basically have to bring in two lines for this one 217 00:18:54,851 --> 00:18:57,820 row. And then. 218 00:18:59,760 --> 00:19:06,712 Over here, If cache lines are arranged this way, as 219 00:19:06,712 --> 00:19:15,885 we go down this column, we're bringing in different cache lines and only using one 220 00:19:15,885 --> 00:19:21,740 value out of that cache line. Now, that's not, not super great. 221 00:19:22,680 --> 00:19:26,033 To make matters worse, let's go look at the output array. 222 00:19:26,033 --> 00:19:29,387 The output array looks relatively similar to the input array. 223 00:19:29,387 --> 00:19:33,731 We're running here, running here, running here, running here, running here, running 224 00:19:33,731 --> 00:19:36,920 there, so we're just kind of streaming through memory here. 225 00:19:38,500 --> 00:19:43,198 I don't know if you could do better. It's a, it's a good question. 226 00:19:43,198 --> 00:19:49,364 At least we're going at least only how this code is ran, we're going from left to 227 00:19:49,364 --> 00:19:53,842 right, so we're actually taking advantage of spatial locality. 228 00:19:53,842 --> 00:19:59,495 We could be doing, could be doing worse. But just to recap here, this is pretty, 229 00:19:59,495 --> 00:20:04,634 pretty poor cache axises. We're bringing many cache lines, they're all going to 230 00:20:04,634 --> 00:20:08,084 fight each other, especially for large arrays. 231 00:20:08,297 --> 00:20:14,400 And over here, we're pulling in line after line, 232 00:20:14,960 --> 00:20:20,585 Even though one of the interesting things is, after we do this multiplication times 233 00:20:20,585 --> 00:20:25,056 that, we're going to do this multiplication by the next one which 234 00:20:25,056 --> 00:20:28,157 sounds good, you know? This times this, this times this. 235 00:20:28,157 --> 00:20:30,897 We should be getting tempral locality here. 236 00:20:30,897 --> 00:20:35,027 But the downside is for very large size matrices. 237 00:20:35,027 --> 00:20:39,711 As I said, this can be multiple blocks or multiple lines long. 238 00:20:39,711 --> 00:20:45,546 What's going to end up happening here is this may conflict with something else 239 00:20:45,546 --> 00:20:50,306 farther down if it's, it's a very large cache, in its own array. 240 00:20:50,537 --> 00:20:54,530 You're just increasing the occupancy into the cache. 241 00:20:54,530 --> 00:20:58,740 So, we're going to talk about blocking, which will actually make it so we can 242 00:20:58,740 --> 00:21:03,400 actually only read one block at a time and do the multiplication in multiple steps. 243 00:21:05,080 --> 00:21:10,473 Okay. So, the cool get a little more complicated when we go to blocking our 244 00:21:10,473 --> 00:21:16,606 cache or blocking our memory multiply. Sometimes people call this tiling instead 245 00:21:16,606 --> 00:21:20,522 of blocking. Sort of, I think tiling is traditionally, 246 00:21:20,522 --> 00:21:25,546 is sort of meant as a matrix multiplication specific term to a more 247 00:21:25,546 --> 00:21:31,092 general thing, which we call blocking. Code gets, code gets a little more 248 00:21:31,092 --> 00:21:34,757 complicated. But what we're going to do here is we're 249 00:21:34,757 --> 00:21:40,547 going to start off by here and multiply these three elements by these top three 250 00:21:40,547 --> 00:21:48,122 elements and deposit it into the result. Now, you might say to yourself, that's not 251 00:21:48,122 --> 00:21:51,696 the results that needs to go there. It's not. 252 00:21:51,696 --> 00:21:55,540 It's only half of the result that needs to go there in fact. 253 00:21:56,160 --> 00:22:02,654 So, you need to do something. You need to finish this and multiply by 254 00:22:02,654 --> 00:22:08,010 that and. Add what was here before with that new 255 00:22:08,010 --> 00:22:15,933 value But the nice thing is addition is communicative so we can re-order these 256 00:22:15,933 --> 00:22:19,345 steps. And, in fact, what we're going to do is 257 00:22:19,345 --> 00:22:26,167 we're going to block or tile and have this block rotated and multiplied with that 258 00:22:26,167 --> 00:22:32,791 block in the positive results here. And then, in our next step here, you see 259 00:22:32,791 --> 00:22:38,385 we're actually going to. Sum them into the same destination. 260 00:22:38,385 --> 00:22:42,650 So, the first step is we are going to take this, multiply by that. 261 00:22:42,650 --> 00:22:47,380 And what we are going to see here is we have one, two, three cache lines. 262 00:22:47,380 --> 00:22:52,045 And over here, we are actually going to have one, two, three cache lines. 263 00:22:52,045 --> 00:22:57,175 Even though we are accessing them this way, we actually design this algorithm 264 00:22:57,175 --> 00:23:01,840 such that we reuse the other portion of, of the data that we pull in here. 265 00:23:02,240 --> 00:23:09,199 And then, we're going to do that,, and sum the partial parts of the multiplies into 266 00:23:09,199 --> 00:23:13,906 these nine locations. And then, we're going to do the second 267 00:23:13,906 --> 00:23:17,344 half of that. We're going to take this, multiply that, 268 00:23:17,344 --> 00:23:22,242 and accumulate them into there. As you'll see, this actually is going to 269 00:23:22,242 --> 00:23:28,104 have some better locality which we'll talk about in a second and we have to bring in 270 00:23:28,104 --> 00:23:32,640 fewer cache lines at the same time and still get good performance. 271 00:23:33,160 --> 00:23:38,560 Okay So, let's look at the, if you want to compute this part over here cuz we're not, 272 00:23:38,560 --> 00:23:43,639 we didn't, we talked about in, in this example here, computing this entire result 273 00:23:43,639 --> 00:23:46,918 to raise, so how do we compute that part over there? 274 00:23:46,918 --> 00:23:51,613 Well. It's the same idea, you're going to take 275 00:23:51,613 --> 00:23:56,877 this up over that. Deposit it there, and then take this, 276 00:23:56,877 --> 00:24:01,260 rotate it and multiplied by that, and accumulate into there. 277 00:24:03,660 --> 00:24:13,497 So, a couple, a couple things. Start recap here, couple, couple happy 278 00:24:13,497 --> 00:24:22,565 things about this, we're really able to shrink the working set in our cache, so 279 00:24:22,565 --> 00:24:28,902 our cache capacity is lower for what's really going on here or, or, rather our 280 00:24:29,076 --> 00:24:33,427 effective cache load is lower. The number of cache lines we need to keep 281 00:24:33,427 --> 00:24:43,294 in our cache is quite a bit lower. We also can have better, so that's going 282 00:24:43,294 --> 00:24:49,234 to be temporal locality cuz we're basically going to have the data in our 283 00:24:49,234 --> 00:24:52,410 cache, and we're going to be working on the data in our cache. 284 00:24:52,410 --> 00:24:56,847 Now, we also end up with, if you look at the input array here, we're actually going 285 00:24:56,847 --> 00:25:01,691 to have better spatial locality. Because before we were sort of stream 286 00:25:01,691 --> 00:25:06,278 through all these different cache lines, and now we can work on one block at a 287 00:25:06,278 --> 00:25:08,689 time. And especially, in our result matrix, 288 00:25:08,689 --> 00:25:16,440 because this and this reuse these entries in our result matrix twice. 289 00:25:18,800 --> 00:25:21,924 So, we can get, get some benefit out of that. 290 00:25:21,924 --> 00:25:27,956 But, so, so the answer is you get both temporal and spatial locality improvements 291 00:25:27,956 --> 00:25:34,409 here. But it's a fun software optimization you 292 00:25:34,409 --> 00:25:38,918 can do to kind of improve your performance. 293 00:25:38,918 --> 00:25:45,672 Okay. So, let's go look at compiler memory optimizations and what we can get out of 294 00:25:45,672 --> 00:25:48,399 this. Well, you can get, get some different 295 00:25:48,399 --> 00:25:53,481 things out of this if you do the cool tricks where you have non-temporal hints, 296 00:25:53,481 --> 00:25:58,129 to some extent, you can use that to increase your, your bandwidth cuz you're 297 00:25:58,129 --> 00:26:02,281 utilizing less bandwidth. Hit time, I don't think you really get any 298 00:26:02,281 --> 00:26:06,743 well, software to change the hit time of, you know, it's a, it's a hardware 299 00:26:06,743 --> 00:26:10,710 constraint, so we're not going to have anything to do about that. 300 00:26:10,710 --> 00:26:14,232 Missed penalty we might be able to help out with. 301 00:26:14,232 --> 00:26:17,890 So, how does the compiler help out with missed penalty? 302 00:26:17,890 --> 00:26:23,241 Well, it can schedule the data so that it's in closer for instance, the blocking 303 00:26:23,241 --> 00:26:26,898 algorithm will actually use a smaller cache footprint. 304 00:26:26,898 --> 00:26:31,301 So, fit into a smaller cache, so we won't have to go out to the L2. 305 00:26:31,301 --> 00:26:35,975 So a good example of this is, let's say, you have a matrix multiply, 306 00:26:35,975 --> 00:26:38,888 In your matrix multiply, it fits in your L2. 307 00:26:38,888 --> 00:26:43,020 The, the naive approach, it fits in your L2 but doesn't fit in the L1.