1
00:00:03,340 --> 00:00:06,917
Okay. So now, we get to, what can software
do for my cache?

2
00:00:06,917 --> 00:00:12,812
What can software do for my memory system?
Well, surprisingly, a lot.

3
00:00:12,812 --> 00:00:19,692
And compiler people actually do spend a
fair amount of time thinking about this.

4
00:00:22,117 --> 00:00:28,610
So, one of the things you'll actually see
is you'll see the compiler many times,

5
00:00:28,610 --> 00:00:33,079
Pad out data structures.
So, if you disassemble a piece of Ccode

6
00:00:33,079 --> 00:00:38,248
that you have, you'll actually see a
compiler many times will insert extra

7
00:00:38,248 --> 00:00:42,717
bytes, which wastes memory but it's there
to improve performance.

8
00:00:42,717 --> 00:00:47,886
And how this improves performance is it
will make the access patterns for

9
00:00:47,886 --> 00:00:53,752
instance, either not conflict in our cache
or conflict in our cache in a strange way

10
00:00:53,752 --> 00:00:58,711
which is beneficial or will somehow
exploit either temporal or spatial

11
00:00:58,711 --> 00:01:05,310
locality.
So, you can do lots of tricks here to

12
00:01:05,310 --> 00:01:12,116
either improve spatial or temporal
locality. one of the things I do want to

13
00:01:12,116 --> 00:01:18,424
talk about is sort of a hybrid software,
hardware solution here that you can

14
00:01:18,424 --> 00:01:24,027
actually add instructions to your
processor, to give hints.

15
00:01:24,027 --> 00:01:28,706
Actually, allow the compiler to give hints
to the architecture.

16
00:01:28,706 --> 00:01:34,177
Whether to pull something into the cash or
not pull something into the cash.

17
00:01:34,177 --> 00:01:39,936
So, many times the compiler will actually
know, or programmer will know, I need to

18
00:01:39,936 --> 00:01:45,356
use this data, exactly once.
Now, if you, if you know you only need to

19
00:01:45,356 --> 00:01:48,500
use that piece of data once, should you
pull it into your cache?

20
00:01:49,840 --> 00:01:51,584
No.
You're never going to reuse it.

21
00:01:51,584 --> 00:01:55,804
You're never, you're never going to get a
cache hit, you're just going to take the

22
00:01:55,804 --> 00:02:00,192
cache miss and then some point later in
the future, you're just going to evict it,

23
00:02:00,192 --> 00:02:02,781
and you're, and it's going to hurt your
miss rate.

24
00:02:02,781 --> 00:02:06,967
So, instead,
There's been an introduction actually,

25
00:02:06,967 --> 00:02:12,799
sometimes people call this non-temporal
hints or no allocate bits, so one way this

26
00:02:12,799 --> 00:02:18,357
actually works out sometimes is certain
pages in our memory system are marked as

27
00:02:18,357 --> 00:02:21,445
being non-temporal and it won't be
cacheable.

28
00:02:21,651 --> 00:02:26,797
You can do this as lets say, six as
actually non-cacheable bits in the page

29
00:02:26,797 --> 00:02:30,375
tables for instance.
A finer grain way of doing this is

30
00:02:30,375 --> 00:02:35,308
actually having special load instructions
or special store instructions which say do

31
00:02:35,308 --> 00:02:40,067
not ever bring this into my cache or do
not pollute my cache with this data cuz I

32
00:02:40,067 --> 00:02:44,768
never intend to go read this again.
And lot's of architectures have started to

33
00:02:44,768 --> 00:02:51,150
include this now as a optimization.
Another fun thing you can do is if you

34
00:02:51,150 --> 00:02:56,542
know, if the compiler or the software
system, or the programmer knows you're

35
00:02:56,542 --> 00:03:02,229
never going to use the data again, why let
it sit there and rot in your cache?

36
00:03:02,229 --> 00:03:08,535
Get it out, get it out as quickly as
possible, now there's a couple different

37
00:03:08,535 --> 00:03:13,651
ways to do this. So, one of the, the, the
instructions that you can add to the

38
00:03:13,651 --> 00:03:19,095
architecture and have a software/hardware
approach to this and you'll actually add

39
00:03:19,095 --> 00:03:22,899
something that I think that an X86 is
called actually called flushing and

40
00:03:22,899 --> 00:03:25,794
validate.
So, the idea is that you, you may still

41
00:03:25,794 --> 00:03:30,607
have dirty data in that location.
And you want to flush it out the main

42
00:03:30,607 --> 00:03:35,081
memory or flush out my cache and
invalidate the data in the cache.

43
00:03:35,081 --> 00:03:40,029
And by giving the compiler or the
programmer primary control of this, you

44
00:03:40,029 --> 00:03:45,587
can effectively kill data that is in the
cache and make it clean or make it clean

45
00:03:45,587 --> 00:03:48,095
and not valid in the cache,
That line,

46
00:03:48,095 --> 00:03:52,840
So when you go to bring in new data, you
don't have to evict something.

47
00:03:52,840 --> 00:03:56,511
So, you are effectively,
It's kind of the inverse of prefetch.

48
00:03:56,511 --> 00:03:58,986
In software prefetch, you bring something
in early.

49
00:03:58,986 --> 00:04:05,296
Here, we push something out early.
And that, that can be a, a pretty big

50
00:04:05,296 --> 00:04:12,139
benefit.
One of the other things is that not I

51
00:04:12,139 --> 00:04:17,229
don't quite have this on, I don't have a
slide here, but there's a, there's a

52
00:04:17,229 --> 00:04:21,029
corollary to this.
There's a, lots of times architectures

53
00:04:21,029 --> 00:04:26,255
will have instruction called basically
it's just like invalid instruction.

54
00:04:26,255 --> 00:04:30,456
It'll invalidate the data.
And this is, the semantics of this are a

55
00:04:30,456 --> 00:04:34,337
little bit, you have to be a little bit
careful with, because this unviel does not

56
00:04:34,337 --> 00:04:38,313
flush the backing data out to main memory.
You're changing architectural state with

57
00:04:38,313 --> 00:04:40,757
this instruction.
So, if there was dirty data, you're

58
00:04:40,757 --> 00:04:44,399
effectively deleting that dirty data.
Sometimes people will actually use

59
00:04:44,559 --> 00:04:48,832
instructions similar to that instead of a
flushing and invalidate, if you know that

60
00:04:48,832 --> 00:04:52,571
you just have scratchpad memory where you
don't need the backing data.

61
00:04:52,571 --> 00:04:56,898
And that's a way to, you don't even have
to waste the writeback bandwidth to your

62
00:04:56,898 --> 00:05:00,797
L2, or your L3, or you main memory.
You can just say this data is no, is now

63
00:05:00,797 --> 00:05:02,720
dead.
I never want to use this again.

64
00:05:02,720 --> 00:05:04,910
So, that's typically an invalid
instruction.

65
00:05:04,910 --> 00:05:09,737
And invalid instructions will lots of
times take a cache line number instead of

66
00:05:09,737 --> 00:05:14,565
an address, sometimes it will take an
address and it will validate the entire

67
00:05:14,565 --> 00:05:19,518
cache line. Unfortunately, you need to
take the entire cache line or cache block

68
00:05:19,518 --> 00:05:24,927
and throw it out.
And the, finally, the inverse instruction

69
00:05:24,927 --> 00:05:30,279
of that instruction which a couple
architectures have it.

70
00:05:30,279 --> 00:05:34,381
I know Alpha has, I think, X86s later
added it.

71
00:05:34,649 --> 00:05:41,560
I don't know what it's called on X86. on
alpha, it's called it's like something

72
00:05:41,560 --> 00:05:45,014
called like write 064, I think is what
it's called.

73
00:05:45,014 --> 00:05:50,322
Basically, what you can do is if you have
some data and you want to write some data

74
00:05:50,322 --> 00:05:54,735
to your cache but you don't want to have
to pull in the backing data,

75
00:05:54,735 --> 00:05:59,851
Because that will just effectively add
bandwidth out to your main memory system

76
00:05:59,851 --> 00:06:04,840
or out to your higher levels of cache.
Instead, what you can do is there's a

77
00:06:04,840 --> 00:06:10,276
special instruction which will just fill
the whole entire cache line with zeros out

78
00:06:10,276 --> 00:06:14,105
of nowhere.
And what's nice about this, is now that

79
00:06:14,105 --> 00:06:18,436
you've filled in an entire cache line of
zeros, you can go do rights to those

80
00:06:18,436 --> 00:06:22,946
zeroes and fill in the useful data and
never have to have gone an fetched the

81
00:06:22,946 --> 00:06:25,420
background data from memory.
So, it saves bandwidth.

82
00:06:25,760 --> 00:06:31,847
So, these are all different hardware
software techniques together that the

83
00:06:31,847 --> 00:06:38,513
compiler can use to optimize performance.
But yeah.

84
00:06:38,513 --> 00:06:43,288
I do, do want to finally just one last
thing about this improving spatial

85
00:06:43,288 --> 00:06:47,685
locality and temporal locality.
If you ever go disassemble a C program,

86
00:06:47,685 --> 00:06:51,957
you'll definitely see structures.
If you, if you run a C program with

87
00:06:52,146 --> 00:06:57,729
feedback driven compilation,
You'll see that, a lot of times, they'll

88
00:06:57,729 --> 00:07:01,504
actually reorganize the data structure and
languages.

89
00:07:01,504 --> 00:07:07,131
So, for instance, if you go look at the
careful description of what C says about

90
00:07:07,131 --> 00:07:13,115
this some languages allow you to insert
padding or reorganize some of the fields

91
00:07:13,115 --> 00:07:17,153
in a data structure.
And by doing that, you can get more

92
00:07:17,153 --> 00:07:22,861
performance by effectively reorganizing
data so you put all of the things that you

93
00:07:22,861 --> 00:07:28,844
access in a row, right next to each other
and that will get you for instance spatial

94
00:07:28,844 --> 00:07:30,121
locality.
Okay.

95
00:07:30,121 --> 00:07:37,620
So now, we get to go to some of the fun
software optimization strategies up here.

96
00:07:39,560 --> 00:07:42,900
So, let's go to the, the, the first one
here.

97
00:07:43,160 --> 00:07:46,375
I'm going to give you guys a second to
look at this code.

98
00:07:46,553 --> 00:07:50,900
Look at the, look at the top example first
or the top piece of code first.

99
00:07:52,780 --> 00:08:01,879
We have a loop around a loop.
For j is less than N, for i is less than

100
00:08:01,879 --> 00:08:07,480
M.
Okay.

101
00:08:07,840 --> 00:08:18,275
So, it's doubly index array x here were
effectively striding through this array,

102
00:08:18,275 --> 00:08:23,640
somehow, or multiplying every element in
the array by two.

103
00:08:25,160 --> 00:08:33,342
Seems, seems basic enough.
So, one of the questions that comes along

104
00:08:33,342 --> 00:08:45,364
here is in C, the default layout of x,
it's a doubly indexed array and we're

105
00:08:45,364 --> 00:08:52,804
walking, by the higher index. It's the,
the inner, inner loop here.

106
00:08:55,280 --> 00:09:04,640
So, if we, if we look what's happening
here, this is a 2D array x.

107
00:09:09,413 --> 00:09:21,540
And i goes that way and j goes, goes that
way.

108
00:09:21,960 --> 00:09:35,560
Now, we're going and we're walking through
this array by i. So, we're going to walk

109
00:09:35,560 --> 00:09:47,180
down, walk down, in that pattern.
By default, in C, this array gets laid out

110
00:09:47,180 --> 00:09:55,440
such that the second index gets sort of
laid out contiguously.

111
00:09:55,780 --> 00:10:01,640
And the higher order index gets laid out
non-continuously.

113
00:04:39,993 --> 00:10:10,765
would be address two, this would be
address three, four, five, assuming these

114
00:10:10,765 --> 00:10:16,712
were all bytes.
Let me come back over here, six, seven,

115
00:10:16,712 --> 00:10:22,941
eight, nine, ten, eleven.
Unfortunately, with the way that the top

116
00:10:22,941 --> 00:10:26,240
loop is written, we're effectively
hitting,

117
00:10:28,420 --> 00:10:32,740
What we're fighting, spatial locality of
the cache lines.

118
00:10:33,940 --> 00:10:41,330
Huh, that's not good.
Well, lo and behold, if we do the exact

119
00:10:41,330 --> 00:10:49,424
same operation and our compiler can easily
detect this, and we just flip the order

120
00:10:49,424 --> 00:10:57,122
that we strived through the array,
Instead, we'll go down through the array,

121
00:11:01,465 --> 00:11:08,813
and get spatial locality perfectly.
So, by doing a simple loop interchange,

122
00:11:08,813 --> 00:11:16,194
life, life gets a lot better.
Actually, before we move on here, I do

123
00:11:16,194 --> 00:11:21,660
want to say this brings up a good example
of packing of data.

124
00:11:22,060 --> 00:11:34,060
I have something here which says,
Which has six bytes in a row.

125
00:11:34,600 --> 00:11:40,720
Zero, one, two, three, four, five.
Now, one optimization that compilers will

126
00:11:40,720 --> 00:11:44,545
many times do is they'll actually had out
the structure.

127
00:11:44,545 --> 00:11:50,487
Let's say, out to eight bytes cuz it makes
the math a lot easier to go strive through

128
00:11:50,487 --> 00:11:54,101
the array and indices.
So, it's an example of one of the, the

129
00:11:54,101 --> 00:11:57,982
things that I showed before.
It will also even sometimes have better

130
00:11:57,982 --> 00:12:01,520
cache performance.
Cuz you'll pull an entire line at a time

131
00:12:01,520 --> 00:12:05,801
versus straddling different cache lines,
if you have an odd, a non-naturally

132
00:12:05,801 --> 00:12:10,823
aligned structure like we had before.
That may not happen in this,, it may not

133
00:12:10,823 --> 00:12:13,620
hurt in this example for the striving for
the entire array.

134
00:12:13,620 --> 00:12:19,290
But other examples it could, it could
definitely be harmful if you are trying to

135
00:12:19,290 --> 00:12:22,060
operate a lot within one row, for
instance.

136
00:12:22,960 --> 00:12:26,438
Okay.
Before we move off of loop interchange,

137
00:12:26,438 --> 00:12:31,340
does that oh, actually, what type of
locality does this improve?

138
00:12:31,600 --> 00:12:38,280
Spatial. Yay,
We,e we drilled that one down pretty well.

139
00:12:41,620 --> 00:12:45,670
Okay.
So, let's look at our next piece of code

140
00:12:45,670 --> 00:12:56,696
here. For I, i less than N, we multiply b
of i times c of i and deposit it into a of

141
00:12:56,696 --> 00:13:01,686
i.
And in the next loop you, we do something

142
00:13:01,686 --> 00:13:07,348
very similar.
We take c of i multiply by a of i and

143
00:13:07,348 --> 00:13:16,820
deposit it into d of i.
Well, going to say, how does this happen?

144
00:13:16,820 --> 00:13:20,660
This seems like a brain dead way to write
this piece of code.

145
00:13:21,140 --> 00:13:25,280
Yes and no.
Maybe you do something in between here,

146
00:13:25,920 --> 00:13:31,032
Between these two four loops.
And it may not be readily apparent to

147
00:13:31,032 --> 00:13:36,831
everybody that you can actually do
anything about this, like because we have

148
00:13:36,831 --> 00:13:42,173
this piece of code here which goes and
reads, I don't know, all they, for

149
00:13:42,173 --> 00:13:46,397
instance.
Well, a compiler can recognize this, and

150
00:13:46,397 --> 00:13:49,720
actually say, well, we just wrote this
entire array.

151
00:13:50,160 --> 00:13:54,860
And now, we go back and read this entire
array and we read c before also.

152
00:13:55,520 --> 00:14:00,589
That's not, that's not very, very good.
Because we're effectively, let's say, it

153
00:14:00,589 --> 00:14:05,338
caches out, assuming these arrays are
large, we're going to read and see an

154
00:14:05,338 --> 00:14:10,311
element of c read an element of b deposit
in a We're basically going to strive for

155
00:14:10,311 --> 00:14:14,900
the cache and these, all these three
arrays are going to kick each other out.

156
00:14:16,184 --> 00:14:21,774
By the time we get back to here,
A of zero is definitely not going to be in

157
00:14:21,774 --> 00:14:25,154
our cache.
Maybe a of the last element will be in our

158
00:14:25,154 --> 00:14:29,498
cache, or a of N, n will be in our cache,
or a of n minus one will be in our cache,

159
00:14:29,498 --> 00:14:32,060
but a of zero is not going to be in our
cache.

160
00:14:33,520 --> 00:14:41,242
So, compiler can think real hard about
this, and do what's called loop fusion and

161
00:14:41,242 --> 00:14:47,357
pull these statements apart.
Such that you take b of i times c of i put

162
00:14:47,357 --> 00:14:51,907
it here and then use that, you're probably
actually going to do that on a temporary

163
00:14:51,907 --> 00:14:56,403
variable, you're probably not going to do
that with your main memory if you have a

164
00:14:56,403 --> 00:15:00,953
of i and a of i in statements right next
to each other, but you still have to write

165
00:15:00,953 --> 00:15:05,120
a of i, we'll say for program correctness.
And you take a of i and c of i and

166
00:15:05,120 --> 00:15:08,190
multiply them times each other and put
them into d here.

167
00:15:08,190 --> 00:15:13,527
What's nice about this is you're actually
going to bring in a cache line of b of i,

168
00:15:13,527 --> 00:15:18,422
we're going to cache line of c of i,
Do the multiplication.

169
00:15:18,422 --> 00:15:24,159
You're going to bring in, well, you
already have a of i cuz that's your

170
00:15:24,159 --> 00:15:27,135
destination and you just have that
variable.

171
00:15:27,135 --> 00:15:32,411
And then, you can write it over into d of
i, and strive through all those cache

172
00:15:32,411 --> 00:15:36,200
lines together.
And effectively, you are not having to go

173
00:15:36,200 --> 00:15:41,316
re-fetch a of zero multiple times.
Instead, you've fetched it once.

174
00:15:41,316 --> 00:15:47,891
And what's also nice is you don't have to
go refetch, let's say, c of i twice, or c

175
00:15:47,891 --> 00:15:51,707
of zero twice.
It's already there in your cache.

176
00:15:51,707 --> 00:15:54,223
Okay.
So, question comes up here.

177
00:15:54,223 --> 00:15:58,120
What type of locality does this improve?
.

178
00:16:02,780 --> 00:16:08,066
Temporal, yes. Because before we were
bringing something into the cache, kicking

179
00:16:08,066 --> 00:16:11,320
it out of cache, bringing it back into the
cache.

180
00:16:11,600 --> 00:16:16,674
Both these actually exhibit decent spatial
locality actually already, and we're able

181
00:16:16,674 --> 00:16:21,498
to change the spatial locality of what's
going on here, they're both striving

182
00:16:21,498 --> 00:16:25,320
through raised spatially.
But temporaly, we don't have to kick.

183
00:16:25,600 --> 00:16:30,807
Every we don't have to kick a and c out of
the cache effectively, twice, so we're

184
00:16:30,807 --> 00:16:35,199
bringing it into the cache twice.
So, we are able to exploit a bunch of

185
00:16:35,199 --> 00:16:37,755
spatial locality here.
Oh, sorry.

186
00:16:37,755 --> 00:16:41,720
[LAUGH] This will explain a lot of
temporal locality here.

187
00:16:42,620 --> 00:16:48,566
Okay, so now, now we get to something a
little bit more sophisticated, matrix

188
00:16:48,566 --> 00:16:53,019
multiply.
Common thing for computers to do,

189
00:16:53,019 --> 00:16:57,650
You want to take at least in dense matrix
codes or many algebra source codes.

190
00:16:57,650 --> 00:17:00,737
You take matrix a, you want to multiply it
by matrix b.

191
00:17:00,737 --> 00:17:03,468
So, this is like traditional MatLab kind
of code.

192
00:17:03,468 --> 00:17:07,802
You have some Matlab, you have a bigger
array, you want to multiply by another

193
00:17:07,802 --> 00:17:12,136
array And we would take to the matrices,
This applies to higher dimensional

194
00:17:12,136 --> 00:17:13,857
matrices also.
But,

195
00:17:13,857 --> 00:17:17,716
It's easy to draw two-dimensional things
on a two-dimensional screen.

196
00:17:17,716 --> 00:17:21,575
It's hard to draw four-dimensional thing
on a two-dimensional screen.

197
00:17:21,575 --> 00:17:24,960
So, we'll start with that.
Here's our naive implementation.

198
00:17:24,960 --> 00:17:29,412
We have we have three four loops going on
here.

199
00:17:29,412 --> 00:17:35,035
And what we're doing is we're doing the
traditional way that you learn how to do

200
00:17:35,035 --> 00:17:39,270
matrix multiplication.
You take a row, multiply it by a column,

201
00:17:39,270 --> 00:17:42,945
And that's that spot.
Take this row and multiply by that column,

202
00:17:42,945 --> 00:17:46,445
that's that result.
This row by that column, that's that spot.

203
00:17:46,445 --> 00:17:49,887
This row by that column, there and you
just work through it.

204
00:17:49,887 --> 00:17:54,145
And, you're basically, when you, let's,
let's step through actually what the

205
00:17:54,145 --> 00:17:57,879
multiplication actually is.
You take this value, multiply by that

206
00:17:57,879 --> 00:18:02,371
value, take this value, multiply by that
value, and you continue going down and

207
00:18:02,371 --> 00:18:04,880
summing each of the multiplication
results.

208
00:18:06,140 --> 00:18:11,887
And that's how we ultimately end up with
this point here or this value here.

209
00:18:11,887 --> 00:18:21,200
So, this is the input, input, output.
Okay, let's think about what this does to

210
00:18:21,200 --> 00:18:24,227
our cache.
Okay.

211
00:18:24,227 --> 00:18:29,089
Let's, let's say, our cache line.
The, these elements here are, are bigger.

212
00:18:29,089 --> 00:18:34,159
They're not just one byte, or one word.
They're couple words, or, or, they're

213
00:18:34,159 --> 00:18:37,353
maybe like a long, long or something like
that.

214
00:18:37,353 --> 00:18:40,687
They're a big data structure or a big, big
value.

215
00:18:40,687 --> 00:18:45,480
And let's say, only three values fit in a
cache block or a cache line.

216
00:18:46,760 --> 00:18:54,851
So, we go to this multiply, we basically
have to bring in two lines for this one

217
00:18:54,851 --> 00:18:57,820
row.
And then.

218
00:18:59,760 --> 00:19:06,712
Over here,
If cache lines are arranged this way, as

219
00:19:06,712 --> 00:19:15,885
we go down this column, we're bringing in
different cache lines and only using one

220
00:19:15,885 --> 00:19:21,740
value out of that cache line.
Now, that's not, not super great.

221
00:19:22,680 --> 00:19:26,033
To make matters worse, let's go look at
the output array.

222
00:19:26,033 --> 00:19:29,387
The output array looks relatively similar
to the input array.

223
00:19:29,387 --> 00:19:33,731
We're running here, running here, running
here, running here, running here, running

224
00:19:33,731 --> 00:19:36,920
there, so we're just kind of streaming
through memory here.

225
00:19:38,500 --> 00:19:43,198
I don't know if you could do better.
It's a, it's a good question.

226
00:19:43,198 --> 00:19:49,364
At least we're going at least only how
this code is ran, we're going from left to

227
00:19:49,364 --> 00:19:53,842
right, so we're actually taking advantage
of spatial locality.

228
00:19:53,842 --> 00:19:59,495
We could be doing, could be doing worse.
But just to recap here, this is pretty,

229
00:19:59,495 --> 00:20:04,634
pretty poor cache axises. We're bringing
many cache lines, they're all going to

230
00:20:04,634 --> 00:20:08,084
fight each other, especially for large
arrays.

231
00:20:08,297 --> 00:20:14,400
And over here, we're pulling in line after
line,

232
00:20:14,960 --> 00:20:20,585
Even though one of the interesting things
is, after we do this multiplication times

233
00:20:20,585 --> 00:20:25,056
that, we're going to do this
multiplication by the next one which

234
00:20:25,056 --> 00:20:28,157
sounds good, you know?
This times this, this times this.

235
00:20:28,157 --> 00:20:30,897
We should be getting tempral locality
here.

236
00:20:30,897 --> 00:20:35,027
But the downside is for very large size
matrices.

237
00:20:35,027 --> 00:20:39,711
As I said, this can be multiple blocks or
multiple lines long.

238
00:20:39,711 --> 00:20:45,546
What's going to end up happening here is
this may conflict with something else

239
00:20:45,546 --> 00:20:50,306
farther down if it's, it's a very large
cache, in its own array.

240
00:20:50,537 --> 00:20:54,530
You're just increasing the occupancy into
the cache.

241
00:20:54,530 --> 00:20:58,740
So, we're going to talk about blocking,
which will actually make it so we can

242
00:20:58,740 --> 00:21:03,400
actually only read one block at a time and
do the multiplication in multiple steps.

243
00:21:05,080 --> 00:21:10,473
Okay. So, the cool get a little more
complicated when we go to blocking our

244
00:21:10,473 --> 00:21:16,606
cache or blocking our memory multiply.
Sometimes people call this tiling instead

245
00:21:16,606 --> 00:21:20,522
of blocking.
Sort of, I think tiling is traditionally,

246
00:21:20,522 --> 00:21:25,546
is sort of meant as a matrix
multiplication specific term to a more

247
00:21:25,546 --> 00:21:31,092
general thing, which we call blocking.
Code gets, code gets a little more

248
00:21:31,092 --> 00:21:34,757
complicated.
But what we're going to do here is we're

249
00:21:34,757 --> 00:21:40,547
going to start off by here and multiply
these three elements by these top three

250
00:21:40,547 --> 00:21:48,122
elements and deposit it into the result.
Now, you might say to yourself, that's not

251
00:21:48,122 --> 00:21:51,696
the results that needs to go there.
It's not.

252
00:21:51,696 --> 00:21:55,540
It's only half of the result that needs to
go there in fact.

253
00:21:56,160 --> 00:22:02,654
So, you need to do something.
You need to finish this and multiply by

254
00:22:02,654 --> 00:22:08,010
that and.
Add what was here before with that new

255
00:22:08,010 --> 00:22:15,933
value But the nice thing is addition is
communicative so we can re-order these

256
00:22:15,933 --> 00:22:19,345
steps.
And, in fact, what we're going to do is

257
00:22:19,345 --> 00:22:26,167
we're going to block or tile and have this
block rotated and multiplied with that

258
00:22:26,167 --> 00:22:32,791
block in the positive results here.
And then, in our next step here, you see

259
00:22:32,791 --> 00:22:38,385
we're actually going to.
Sum them into the same destination.

260
00:22:38,385 --> 00:22:42,650
So, the first step is we are going to take
this, multiply by that.

261
00:22:42,650 --> 00:22:47,380
And what we are going to see here is we
have one, two, three cache lines.

262
00:22:47,380 --> 00:22:52,045
And over here, we are actually going to
have one, two, three cache lines.

263
00:22:52,045 --> 00:22:57,175
Even though we are accessing them this
way, we actually design this algorithm

264
00:22:57,175 --> 00:23:01,840
such that we reuse the other portion of,
of the data that we pull in here.

265
00:23:02,240 --> 00:23:09,199
And then, we're going to do that,, and sum
the partial parts of the multiplies into

266
00:23:09,199 --> 00:23:13,906
these nine locations.
And then, we're going to do the second

267
00:23:13,906 --> 00:23:17,344
half of that.
We're going to take this, multiply that,

268
00:23:17,344 --> 00:23:22,242
and accumulate them into there.
As you'll see, this actually is going to

269
00:23:22,242 --> 00:23:28,104
have some better locality which we'll talk
about in a second and we have to bring in

270
00:23:28,104 --> 00:23:32,640
fewer cache lines at the same time and
still get good performance.

271
00:23:33,160 --> 00:23:38,560
Okay So, let's look at the, if you want to
compute this part over here cuz we're not,

272
00:23:38,560 --> 00:23:43,639
we didn't, we talked about in, in this
example here, computing this entire result

273
00:23:43,639 --> 00:23:46,918
to raise, so how do we compute that part
over there?

274
00:23:46,918 --> 00:23:51,613
Well.
It's the same idea, you're going to take

275
00:23:51,613 --> 00:23:56,877
this up over that.
Deposit it there, and then take this,

276
00:23:56,877 --> 00:24:01,260
rotate it and multiplied by that, and
accumulate into there.

277
00:24:03,660 --> 00:24:13,497
So, a couple, a couple things.
Start recap here, couple, couple happy

278
00:24:13,497 --> 00:24:22,565
things about this, we're really able to
shrink the working set in our cache, so

279
00:24:22,565 --> 00:24:28,902
our cache capacity is lower for what's
really going on here or, or, rather our

280
00:24:29,076 --> 00:24:33,427
effective cache load is lower.
The number of cache lines we need to keep

281
00:24:33,427 --> 00:24:43,294
in our cache is quite a bit lower.
We also can have better, so that's going

282
00:24:43,294 --> 00:24:49,234
to be temporal locality cuz we're
basically going to have the data in our

283
00:24:49,234 --> 00:24:52,410
cache, and we're going to be working on
the data in our cache.

284
00:24:52,410 --> 00:24:56,847
Now, we also end up with, if you look at
the input array here, we're actually going

285
00:24:56,847 --> 00:25:01,691
to have better spatial locality.
Because before we were sort of stream

286
00:25:01,691 --> 00:25:06,278
through all these different cache lines,
and now we can work on one block at a

287
00:25:06,278 --> 00:25:08,689
time.
And especially, in our result matrix,

288
00:25:08,689 --> 00:25:16,440
because this and this reuse these entries
in our result matrix twice.

289
00:25:18,800 --> 00:25:21,924
So, we can get, get some benefit out of
that.

290
00:25:21,924 --> 00:25:27,956
But, so, so the answer is you get both
temporal and spatial locality improvements

291
00:25:27,956 --> 00:25:34,409
here.
But it's a fun software optimization you

292
00:25:34,409 --> 00:25:38,918
can do to kind of improve your
performance.

293
00:25:38,918 --> 00:25:45,672
Okay. So, let's go look at compiler memory
optimizations and what we can get out of

294
00:25:45,672 --> 00:25:48,399
this.
Well, you can get, get some different

295
00:25:48,399 --> 00:25:53,481
things out of this if you do the cool
tricks where you have non-temporal hints,

296
00:25:53,481 --> 00:25:58,129
to some extent, you can use that to
increase your, your bandwidth cuz you're

297
00:25:58,129 --> 00:26:02,281
utilizing less bandwidth.
Hit time, I don't think you really get any

298
00:26:02,281 --> 00:26:06,743
well, software to change the hit time of,
you know, it's a, it's a hardware

299
00:26:06,743 --> 00:26:10,710
constraint, so we're not going to have
anything to do about that.

300
00:26:10,710 --> 00:26:14,232
Missed penalty we might be able to help
out with.

301
00:26:14,232 --> 00:26:17,890
So, how does the compiler help out with
missed penalty?

302
00:26:17,890 --> 00:26:23,241
Well, it can schedule the data so that
it's in closer for instance, the blocking

303
00:26:23,241 --> 00:26:26,898
algorithm will actually use a smaller
cache footprint.

304
00:26:26,898 --> 00:26:31,301
So, fit into a smaller cache, so we won't
have to go out to the L2.

305
00:26:31,301 --> 00:26:35,975
So a good example of this is, let's say,
you have a matrix multiply,

306
00:26:35,975 --> 00:26:38,888
In your matrix multiply, it fits in your
L2.

307
00:26:38,888 --> 00:26:43,020
The, the naive approach, it fits in your
L2 but doesn't fit in the L1.