1
00:00:03,039 --> 00:00:08,001
So, now that we've talked about these
different memory technologies.

2
00:00:08,061 --> 00:00:13,065
Let's move on in talking about an
introduction to caches, and why we have

3
00:00:13,085 --> 00:00:18,052
caches, and where this really comes from.
From sort of, first principle's

4
00:00:18,052 --> 00:00:22,041
perspective.
And, to start off motivating caches, let's

5
00:00:22,041 --> 00:00:25,058
draw the most basic memory system we can
draw.

6
00:00:25,058 --> 00:00:30,310
So, on our abstract computer model, we
have a processor and it connects to

7
00:00:30,310 --> 00:00:35,030
somewhere to store our data and our
instructions, or our main memory.

8
00:00:35,030 --> 00:00:42,017
And, you know, the, the time it takes to
actually access something like our main

9
00:00:42,017 --> 00:00:45,073
memory can many times be much, much larger
than our cycle time.

10
00:00:45,073 --> 00:00:49,666
Or the latency can be, can be much longer.
So, that D-RAM I passed around, you know,

11
00:00:49,666 --> 00:00:54,709
to get, to get a bite out of, or to get,
like, the first bite out of something like

12
00:00:54,709 --> 00:00:57,745
a modern day's D-RAM.
You're basically, it's going to be

13
00:00:57,745 --> 00:01:01,416
thousands of instructions.
If you, if you, or probably not.

14
00:01:01,416 --> 00:01:06,583
It's probably right around a thousand
actually, in a modern day computer.

15
00:01:06,583 --> 00:01:11,595
It's gotten a little bit better as we'll
see on the next graph for, for a funny

16
00:01:11,595 --> 00:01:14,563
reason.
But, and, and the bandwidth we're going to

17
00:01:14,563 --> 00:01:18,070
define as the, sort of, number of accesses
per unit time.

18
00:01:18,070 --> 00:01:23,233
So, as I said, for the close in, arrays,
you can usually access them much faster

19
00:01:23,233 --> 00:01:26,810
and have more bandwidth through them, sort
of, by definition.

20
00:01:26,810 --> 00:01:31,111
But, for things far away, you have to, to
have less bandwidth, kind of by definition

21
00:01:31,111 --> 00:01:35,337
cuz it should probably go through, through
some pins, or some buses that go very far

22
00:01:35,337 --> 00:01:37,530
away.
And then, we're going to define the

23
00:01:37,530 --> 00:01:42,928
bandwidth delayed product as the amount of
data that can be in flight at the same

24
00:01:42,928 --> 00:01:45,695
time.
And, as you have sort of things farther

25
00:01:45,695 --> 00:01:50,300
away and more bandwidth, this bandwidth
delayed product gets bigger cuz you're

26
00:01:50,300 --> 00:01:54,889
multiplying, sort of, the amount of stuff
the amount of bandwidth you have times

27
00:01:54,889 --> 00:01:59,050
where everything can be in the pipe.
And you've, basically, have a, a big

28
00:01:59,050 --> 00:02:02,018
pipeline that serve thousands of values in
flight.

29
00:02:04,006 --> 00:02:11,055
So, this is what I was eluding to about
the, the gap happening between D-RAM,

30
00:02:11,055 --> 00:02:17,476
bandwidth, and processor performance.
So, memory technology has been, sort of,

31
00:02:17,476 --> 00:02:22,062
slowly getting faster, and processor
technology got a lot faster.

32
00:02:22,062 --> 00:02:27,025
The one thing that, at least, not so bad
anymore for its, for a single scalar or,

33
00:02:27,058 --> 00:02:31,076
for single-processor performance we've
sort of flattened out at this point.

34
00:02:31,076 --> 00:02:35,088
So, [laugh] now RAM is getting faster.
Unfortunately, you know, this is still

35
00:02:35,088 --> 00:02:39,044
keep going exponentially here if you sort
of look at multicore.

36
00:02:39,044 --> 00:02:43,040
If you have multiple cores in the same
chip, you know, the, the amount of

37
00:02:43,040 --> 00:02:47,085
aggregate processor performance is, is
going up relative to the very performance

38
00:02:47,085 --> 00:02:50,861
and, and things get worse.
So, here we have a four issue, two

39
00:02:50,861 --> 00:02:56,099
gigahertz superscalar accessing 100 nano a
second mem, memory, D-RAM, you know, can

40
00:02:56,099 --> 00:03:00,878
execute 800 instructions, during the time
to access that one memory reference.

41
00:03:00,878 --> 00:03:06,804
And, you know, we have things faster than
two gigahertz now, and we also have slower

42
00:03:06,804 --> 00:03:11,679
memories and we, we, you know, depending
what memory system you put it in, it can,

43
00:03:11,679 --> 00:03:16,534
you know, some, some of them are roughly
about a thousand instructions.

44
00:03:17,472 --> 00:03:19,945
Okay.
So, what's, what's the important first

45
00:03:19,945 --> 00:03:22,610
order effect when we're looking at memory
systems?

46
00:03:22,610 --> 00:03:26,142
Well, size does matter.
As you have larger and larger memory

47
00:03:26,142 --> 00:03:30,188
systems, it just takes longer time.
There's more capacity, and if you have to

48
00:03:30,188 --> 00:03:34,759
drive and you have to go physically
farther, and you can't fundamentally fight

49
00:03:34,759 --> 00:03:38,014
the speed of light.
It's going to take you a certain amount of

50
00:03:38,014 --> 00:03:41,077
time to go a certain distance.
And, if that distance is farther, it's

51
00:03:41,077 --> 00:03:45,078
going to take you longer.
So, you also have sort of, more fan out if

52
00:03:45,078 --> 00:03:49,051
you have bigger, bigger memory systems.
So, physical size is, is important here

53
00:03:49,051 --> 00:03:53,623
when we're thinking about memory systems.
Okay.

54
00:03:53,623 --> 00:04:00,980
So, let's now introduce a cache, or idea
of why, why a cache is, is good.

55
00:04:00,980 --> 00:04:06,393
So, let's look at a notion of memory
hierarchy.

56
00:04:06,393 --> 00:04:13,376
Unlike this picture where we had a
processor connected to all of memory in

57
00:04:13,376 --> 00:04:18,077
the system, we're now going to put some
intermediary levels.

58
00:04:19,036 --> 00:04:24,519
So, we have a processor, and we're going
to put small-fast memories here.

59
00:04:24,519 --> 00:04:30,649
And we're hope, hopefully, going to try to
hit the, or access these small and fast

60
00:04:30,649 --> 00:04:34,782
memories, instead of accessing this big
slow D-RAM.

61
00:04:34,782 --> 00:04:45,018
Let's look at the, the capacity here.
The registers, so your [unknown] registers

62
00:04:45,018 --> 00:04:50,032
or registers of whole bus, your program
kind of specialize registers in your

63
00:04:50,032 --> 00:04:55,019
processor, has very small capacity.
S-ram has more capacity, D-RAM has even

64
00:04:55,019 --> 00:04:58,396
more capacity.
And, its sort of double, less than there

65
00:04:58,396 --> 00:05:03,411
to extenuate that you can fill a lot more
in the same space.

66
00:05:03,411 --> 00:05:09,258
Though the latency follows that same
hierarchy there.

67
00:05:09,258 --> 00:05:14,008
And, an important other thing to think
about when you're building memory

68
00:05:14,008 --> 00:05:17,852
hierarchies is that on-chip memory
bandwidth is typically a lot higher than

69
00:05:17,852 --> 00:05:20,549
off-chip memory bandwidth.
If you're on-chip, you can run wires

70
00:05:20,549 --> 00:05:24,468
there, but the second you have to go
off-chip, you have to go out the pins of

71
00:05:24,468 --> 00:05:26,595
the chip.
You're going to be limited.

72
00:05:26,595 --> 00:05:31,341
And, what this really boils down to is the
pins, or balls, or solder columns.

73
00:05:31,341 --> 00:05:36,926
So, modern day chips we don't actually
have pins on them any more, instead we

74
00:05:36,926 --> 00:05:41,010
have ball grid array, what is little tiny
solder balls.

75
00:05:41,010 --> 00:05:46,512
The size of those, the pitch of those is
still much larger than the wire-on chip.

76
00:05:46,512 --> 00:05:51,592
The wire on chip is so much smaller than
the solder ball, which is, or in

77
00:05:51,592 --> 00:05:57,649
modern-day actually, that's something like
your x86 core i5 or something like that.

78
00:05:57,649 --> 00:06:01,835
Or, Intel core i5 actually has, what's
called land grid array.

79
00:06:01,835 --> 00:06:07,720
Where you have a little array of little,
the pads, little metal pads on the bottom

80
00:06:07,720 --> 00:06:11,061
of the chip.
And then, there's these little pins are

81
00:06:11,061 --> 00:06:14,837
actually located in the socket on little,
we call them, pogo sticks.

82
00:06:14,837 --> 00:06:19,512
But, they're basically little pieces of
metal on springs and you put the chip in

83
00:06:19,512 --> 00:06:24,013
and then the little pogo stick hits the
piece of metal and that's the contact.

84
00:06:24,013 --> 00:06:27,669
On older chips, you actually had the pins
on there, on the chip itself, and these

85
00:06:27,669 --> 00:06:32,037
are sort of, put it into a big socket.
And, in the middle is ball grid array.

86
00:06:32,037 --> 00:06:36,349
But, anyway, all those things, the
connection technology to get off the chip

87
00:06:36,349 --> 00:06:39,725
is much larger.
So, you can fit fewer of them on the same

88
00:06:39,725 --> 00:06:44,099
area than if you were to wire on chip.
And that's really this, this insight here

89
00:06:44,099 --> 00:06:48,081
that on chip bandwidth is a lot larger
than off chip bandwidth.

90
00:06:48,081 --> 00:06:53,310
And because of this, you're going to want
to think about putting small, fast memory

91
00:06:53,310 --> 00:06:56,093
close by on chip.
We can store, store data.

92
00:06:56,093 --> 00:07:00,044
Okay.
So, what happens in this memory hierarchy?

93
00:07:00,044 --> 00:07:06,187
When we go to access data, if the data is
in our fast memory here, we're going to go

94
00:07:06,187 --> 00:07:10,080
here, hit in our low latency, unscramble
latency and come back.

95
00:07:10,080 --> 00:07:16,325
If the data is not in the fast memory, we
have to go all the way out to our D-RAM

96
00:07:16,325 --> 00:07:23,706
and come back.
And, memory hierarchy's work only if

97
00:07:23,706 --> 00:07:32,136
you're going to find useful data here.
So, if you go build a small fast memory

98
00:07:32,136 --> 00:07:37,251
close by, but you never find any
information in there that you need, why

99
00:07:37,251 --> 00:07:41,090
did you build this structure?
You're just sort of doing extra work and

100
00:07:41,090 --> 00:07:44,826
burning extra power.
And, this is one of the key insights of

101
00:07:44,826 --> 00:07:50,019
why we have caches, or what a cache is.
Is that, when we go to build a memory

102
00:07:50,019 --> 00:07:54,158
hierarchy like this and we introduce
something like a cache, by finding the

103
00:07:54,158 --> 00:07:57,578
data nearby, that's going to be faster to
access.

104
00:07:57,578 --> 00:08:00,769
We need to find the data there with high
probability.

105
00:08:00,769 --> 00:08:05,078
Otherwise, we should just not build it and
just gone out to the main memory all time,

106
00:08:05,078 --> 00:08:07,804
you're just wasting everyone's time and
area, and cost.

107
00:08:07,804 --> 00:08:13,548
So, memory hierarchy only works if you've,
your fast memory actually is, is useful.

108
00:08:13,548 --> 00:08:17,896
Okay.
So, let's look at different memory access

109
00:08:17,896 --> 00:08:27,323
patterns and try to come up with a couple
of different techniques that we can

110
00:08:27,323 --> 00:08:33,504
exploit to increase the probability that
the data we're going to need is in that

111
00:08:33,504 --> 00:08:38,345
small fast memory.
So, we have an interesting graph here, and

112
00:08:38,345 --> 00:08:43,073
I want to explain what this is.
And we'll, we'll look at a, a toy example

113
00:08:43,073 --> 00:08:47,451
first and then we're going to look at a
memory trace, refer, memory reference

114
00:08:47,451 --> 00:08:52,071
trace from a real operating system with a
real program running under it.

115
00:08:52,071 --> 00:08:56,025
So, let's start off with, something, some
examples here.

116
00:08:57,037 --> 00:09:02,070
We have time on the horizontal axis, and
we have address on the vertical axis.

117
00:09:02,070 --> 00:09:07,090
And, it's something like your machine,
let's say, it has 4GB of RAM.

118
00:09:07,090 --> 00:09:12,459
We actually have sort of a line across
here for every single address in the

119
00:09:12,459 --> 00:09:16,908
machine, and then we have time.
And, as the program executes and goes and

120
00:09:16,908 --> 00:09:22,050
accesses different pieces of memory, we're
going to put a dot on this graph.

121
00:09:23,072 --> 00:09:28,007
So, let's start off by looking at
something like an instruction fetch.

122
00:09:28,007 --> 00:09:33,030
Typically, programs will execute in order.
They'll execute the first instruction, the

123
00:09:33,030 --> 00:09:37,059
next instruction, and the third
instruction at the addresses, sort of,

124
00:09:37,059 --> 00:09:40,068
monotonically increasing, unless you hit a
branch.

125
00:09:40,068 --> 00:09:44,015
So, let's, let's take a look at some
instruction fetches.

126
00:09:44,015 --> 00:09:47,933
We start off by executing some code.
And then, we're going to actually execute

127
00:09:47,933 --> 00:09:49,966
a loop here.
So, when we're in the loop, we're going to

128
00:09:49,966 --> 00:09:53,035
be executing the same piece of code over
and over again.

129
00:09:53,035 --> 00:09:57,001
And if you look at that, from an address
perspective, we start off, not in the

130
00:09:57,001 --> 00:09:58,089
loop.
So, we have some address here, some

131
00:09:58,089 --> 00:10:02,074
different address, and then we execute
these three addresses, and then we execute

132
00:10:02,074 --> 00:10:05,564
those three addresses again, and those
three addresses again, those three

133
00:10:05,564 --> 00:10:09,412
addresses again.
We keep fetching those three addresses

134
00:10:09,412 --> 00:10:12,019
over and over again from our memory
system.

135
00:10:12,019 --> 00:10:14,724
And, at some point, we exit the loop and
go execute something else.

136
00:10:17,059 --> 00:10:20,021
Okay.
There's a pattern there.

137
00:10:20,021 --> 00:10:24,063
Now, we're going to think about whether we
can exploit that pattern in a second.

138
00:10:25,025 --> 00:10:28,270
Next, let's talk about accessing the
stack.

139
00:10:28,270 --> 00:10:31,507
So, computer programs typically have
stacks.

140
00:10:31,507 --> 00:10:34,922
This is where you either spill or fill
registers for.

141
00:10:34,922 --> 00:10:37,707
Typically, you also use it to pass
arguments.

142
00:10:37,707 --> 00:10:42,859
And you spill and fill, most modern day
compilers will do that, when you make a

143
00:10:42,859 --> 00:10:46,038
function call, or return from a function
call.

144
00:10:46,038 --> 00:10:49,026
And the argument passing is around
function calls, also.

145
00:10:49,026 --> 00:10:53,039
You'll put arguments that the, yet, the
later functions need on, on the stack, and

146
00:10:53,039 --> 00:10:57,004
then you go execute that.
Alternatively, you can store the arguments

147
00:10:57,004 --> 00:10:59,548
in the registers.
But, on something like x86 will say,

148
00:10:59,548 --> 00:11:03,541
you're going to put arguments onto the
stack, and you also put local variables

149
00:11:03,541 --> 00:11:07,050
onto the stack cuz, 'uz, because it gives
you a convenient place to put it.

150
00:11:07,077 --> 00:11:11,088
So, let's take a look at what happens when
you go to do a subroutine call.

151
00:11:11,088 --> 00:11:14,070
On a subroutine call, the stack is at one
address.

152
00:11:14,070 --> 00:11:18,081
The stack pointer is somewhere here.
And, when you go do a subroutine call,

153
00:11:18,081 --> 00:11:21,091
you're going to push all of your registers
onto the stack.

154
00:11:21,091 --> 00:11:25,793
So, you're going to put something here,
put something here, put something here,

155
00:11:25,793 --> 00:11:29,762
put something here, and the stack pointer
is going to logically increase.

156
00:11:29,762 --> 00:11:35,255
And then, once you're in the function that
you called, you're going to want to

157
00:11:35,255 --> 00:11:38,663
accsss, access, the arguments.
And, those are usually very close

158
00:11:38,663 --> 00:11:42,612
together, they are all right next together
on a stack, or you might have to go access

159
00:11:42,612 --> 00:11:45,746
the argument over and over again.
You know, this is, if you don't have

160
00:11:45,746 --> 00:11:49,391
enough registers it's like at our first
lecture we looked at a stack based

161
00:11:49,391 --> 00:11:52,053
architecture.
And, in a stack based architecture, we had

162
00:11:52,053 --> 00:11:55,790
to actually push onto our stack the same
value multiple times.

163
00:11:55,790 --> 00:12:00,158
It's pretty common to reuse the same local
variable or reuse the same argument, just

164
00:12:00,158 --> 00:12:03,732
cuz you don't have enough registers.
And, you want to go load that data from

165
00:12:03,732 --> 00:12:07,260
your stack over and over again.
So, it's pretty common to access very

166
00:12:07,260 --> 00:12:11,269
close together on the stack, to be
probably the same cache, within a cache

167
00:12:11,269 --> 00:12:15,340
line, or access the data very close by.
And then, when you're done, you return

168
00:12:15,340 --> 00:12:19,840
from subroutine and you're going to pop
all the data off the stack.

169
00:12:19,840 --> 00:12:26,523
So, time is increasing this way, so you're
going to move in reverse order.

170
00:12:26,523 --> 00:12:29,802
Okay.
Let's look at some data axises.

171
00:12:29,802 --> 00:12:34,510
The one I wanted to focus on here is
actually vector axises.

172
00:12:34,510 --> 00:12:38,066
So, what's a vector axis?
Well, this is something like an array.

173
00:12:38,066 --> 00:12:42,537
And arrays, a lot of times you'll be
reading the first word, or the first byte,

174
00:12:42,537 --> 00:12:46,700
then read the next byte, the you read the
next byte, then you're doing some

175
00:12:46,700 --> 00:12:50,403
operation across all of it.
So, let's say, we're doing a matrix

176
00:12:50,403 --> 00:12:53,253
addition.
We have one matrix and we're adding it to

177
00:12:53,253 --> 00:12:56,343
another matrix, and they're single
dimensional matrices.

178
00:12:56,832 --> 00:13:01,546
You're actually going to be reading
different addresses but they're going to

179
00:13:01,546 --> 00:13:05,013
be in some pattern, and they're going to
be sort of close by.

180
00:13:05,013 --> 00:13:09,028
Now, if you're accessing, let's say, every
fourth element in an array, this distance

181
00:13:09,028 --> 00:13:12,524
here would get larger.
The, the address distance is going to get

182
00:13:12,524 --> 00:13:17,684
a little bit large and your access pattern
is going to get a little bit farther a

183
00:13:17,684 --> 00:13:20,791
part.
And, scalar values, these are sort of

184
00:13:20,791 --> 00:13:24,835
singleton values.
So, a good example here is, let's say,

185
00:13:24,835 --> 00:13:29,695
this array that we're doing here is we're
trying to take an array.

186
00:13:29,695 --> 00:13:34,942
And, instead of add two arrays together,
we're going to take one array and we're

187
00:13:34,942 --> 00:13:39,905
going to add five to every single, or some
variable, x which has we'll say, five in

188
00:13:39,905 --> 00:13:46,322
it, to every single element in the array.
So, what your machine would, your

189
00:13:46,322 --> 00:13:51,685
processor would do is it would first load
the thing you're going to be adding to.

190
00:13:51,685 --> 00:13:56,373
So, the, the data from the matrix, and
then add, and then load from the scalar

191
00:13:56,373 --> 00:13:59,948
value, x, from main memory.
Do the add and do store.

192
00:13:59,948 --> 00:14:05,539
Then, your load to the next location of
the vector, load the file, load the x, do

193
00:14:05,539 --> 00:14:08,557
the add, put it back, and then continue
that on.

194
00:14:08,557 --> 00:14:11,083
Okay.
So, can we, can we try to exploit

195
00:14:11,083 --> 00:14:16,056
different types of vocality?
And this slide is really important to

196
00:14:16,056 --> 00:14:21,094
grasp what's going on in caches.
This is probably one of the most important

197
00:14:21,094 --> 00:14:26,314
slide in, in today's lecture, is to, to
grasp what is going on here with the

198
00:14:26,314 --> 00:14:31,084
different patterns we're seeing, and how
do we exploit these patterns.

199
00:14:32,074 --> 00:14:36,052
Okay.
So, there's two type of patterns I want to

200
00:14:36,052 --> 00:14:40,265
talk about.
And, your standard cache system is going

201
00:14:40,265 --> 00:14:45,464
to exploit these two patterns.
But, let's first of all, look at them,

202
00:14:45,464 --> 00:14:50,045
them individually.
And, we'll be able to see the patterns

203
00:14:51,010 --> 00:14:55,099
here.
First one we're going to talk about is

204
00:14:55,099 --> 00:15:01,787
temporal locality.
Temporal locality is this notion that if

205
00:15:01,787 --> 00:15:08,234
you access a piece of data, you're likely
to access that piece of data again, in the

206
00:15:08,234 --> 00:15:13,466
not too distant future.
So, it might be useful to keep that data

207
00:15:13,466 --> 00:15:17,062
nearby.
So, temporal locality I'll say, again, is

208
00:15:17,062 --> 00:15:23,353
if they, if your location is referenced,
it is likely to be referenced again in a

209
00:15:23,353 --> 00:15:28,603
short period of time.
So, let's try to identify that in these

210
00:15:28,603 --> 00:15:32,387
diagrams so far.
Okay, well, I see that a few places.

211
00:15:32,387 --> 00:15:37,280
These are giving access, we access some
value, and we're going to access that

212
00:15:37,280 --> 00:15:41,535
value again pretty soon, and again pretty
soon, and again pretty soon, and again

213
00:15:41,535 --> 00:15:44,990
pretty soon.
And, we're basically, accessing something

214
00:15:44,990 --> 00:15:49,075
with high temporal locality.
Likewise, these scalar values here, this

215
00:15:49,075 --> 00:15:53,186
scalar axis we're accessing again, and
again, and again, it's this exact same

216
00:15:53,186 --> 00:15:56,329
address.
And, the time between one axis and the

217
00:15:56,329 --> 00:16:00,561
next axis is pretty, pretty close by.
Okay.

218
00:16:00,561 --> 00:16:10,254
Now, let's look at this and see other,
other interesting access patterns here.

219
00:16:10,254 --> 00:16:18,051
Well, interestingly, on a stack, when we
go to access a value and we're pushing our

220
00:16:18,051 --> 00:16:26,768
values onto a stack, it's very common.
To access a value at say, address 1,000,

221
00:16:26,768 --> 00:16:33,745
and then go access a 1,000 + four, 1,000 +
eight, 1,000 + twelve, 1,000 + sixteen,

222
00:16:33,752 --> 00:16:41,091
and access all of the nearby values.
So, what are we going to call that?

223
00:16:41,091 --> 00:16:46,037
Is that a, is that a common pattern?
Well, yeah, it's, it's pretty common and

224
00:16:46,037 --> 00:16:49,079
we, we see that here, we see it in loops
for instructions.

225
00:16:49,079 --> 00:16:54,506
We're actually accessing one instruction,
it's very common for us to go access later

226
00:16:54,506 --> 00:16:59,193
instructions very close by.
Packed, let's say, next to it so this is

227
00:16:59,193 --> 00:17:02,081
pretty common that you just fall through
in your code.

228
00:17:02,081 --> 00:17:06,036
So, we're going to define this and call
this, spatial locality.

229
00:17:06,036 --> 00:17:11,021
And, spatial locality says, that if a
location is referenced, it is likely that

230
00:17:11,021 --> 00:17:18,077
locations nearby it are going to be
referenced in the near future.

231
00:17:19,018 --> 00:17:23,036
And these are both just heuristics.
There's nothing magical about these.

232
00:17:23,036 --> 00:17:26,418
These are just heuristics of how programs
happen to work out.

233
00:17:26,418 --> 00:17:30,599
And, but, we're going to build systems
that try to leverage these two heuristics,

234
00:17:30,599 --> 00:17:35,163
or these two characteristics rather, is
two types of locality to improve the

235
00:17:35,163 --> 00:17:39,658
performance of our processor.
So, be able to use those small little RAMs

236
00:17:39,658 --> 00:17:44,513
that we're going to put in our memory
hierarchy and actually store useful data

237
00:17:44,513 --> 00:17:47,072
in them.
I also wanted to sort of point out that

238
00:17:47,072 --> 00:17:50,528
you can have both spatial and temporal
locality mixed together.

239
00:17:50,528 --> 00:17:55,542
So, if we look at something like running a
program which has lots of loops, we have

240
00:17:55,542 --> 00:17:59,397
spatial locality, because we're executing
instructions nearby.

241
00:17:59,397 --> 00:18:04,581
But, we also have temporal locality here,
because we're reusing all this data again,

242
00:18:04,581 --> 00:18:07,397
and again, and again in a short period of
time.

243
00:18:07,397 --> 00:18:12,657
So, in the near future, we are accessing
sort, of the same, same data over and over

244
00:18:12,657 --> 00:18:15,632
again.
Even though it's not the exact same piece

245
00:18:15,632 --> 00:18:20,397
of data, it's, it's nearby, so we're sort
of seeing both spatial and temporal

246
00:18:20,397 --> 00:18:22,442
locality exhibited there.
Okay.

247
00:18:22,442 --> 00:18:28,002
So, let's look at, finally at a real
memory reference pattern or real memory

248
00:18:28,002 --> 00:18:31,074
reference trace.
And this has the same axises of what we

249
00:18:31,074 --> 00:18:34,653
saw before.
We have time, and there's one dot per

250
00:18:34,653 --> 00:18:39,886
axis, and then we have memory addresses.
I don't know what the unit on the vertical

251
00:18:39,886 --> 00:18:45,636
is, it's probably not, it's probably in
like thousands or millions or something

252
00:18:45,636 --> 00:18:49,039
like that.
And this is from a study that some people

253
00:18:49,039 --> 00:18:53,211
at IBM did many years ago.
And, but we're going to look at this and

254
00:18:53,211 --> 00:19:00,010
see different patterns show up here.
So, the first thing I want to circle here

255
00:19:00,010 --> 00:19:06,004
is spacial locality.
So, we see that along this line here, we

256
00:19:06,004 --> 00:19:12,008
access the first piece of data at a, let's
say, this address, and we then go on and

257
00:19:12,008 --> 00:19:17,082
access all of the data subsequently.
So, this is as if we were like reading an

258
00:19:17,082 --> 00:19:21,033
array.
We just read all the values in the array.

259
00:19:21,033 --> 00:19:26,636
So, we have spacial locality that because
we went and accessed element one in the

260
00:19:26,636 --> 00:19:32,504
array, there's a, a high probability that
we're going to go access other data values

261
00:19:32,504 --> 00:19:37,027
which are nearby in that array, in a, in a
nearby time.

262
00:19:37,074 --> 00:19:42,034
Let's take a look at temporal locality.
So, horizontal lines here are temporal

263
00:19:42,034 --> 00:19:44,864
locality.
And, you can see that, this program is

264
00:19:44,864 --> 00:19:49,765
accessing this value over and over again.
Or, at least, values very close by, it's

265
00:19:49,765 --> 00:19:54,399
hard to tell on this resolution, but this
is temporal locality that we're accessing

266
00:19:54,399 --> 00:19:59,029
the same values or, or mostly the same
values over, and over, and over again.

267
00:19:59,029 --> 00:20:04,480
And then, finally, we can see both
temporal and spatial locality in these

268
00:20:04,480 --> 00:20:09,011
little blocks here.
We can see that we're accessing similar

269
00:20:09,011 --> 00:20:14,884
sorts of addresses which is going to be
spatial locality, and we're doing it all

270
00:20:14,884 --> 00:20:20,051
within a short period of time.
So, we're, we're seeing temporal locality

271
00:20:20,051 --> 00:20:30,005
of reference.
And, caches can exploit both of these.

272
00:20:30,094 --> 00:20:38,059
Caches are going to exploit temporal
locality by keeping data that gets used

273
00:20:38,059 --> 00:20:44,030
and reused many, many times, nearby.
So, when you see axises which are

274
00:20:44,030 --> 00:20:49,571
temporally, close by or, or, temporally
spaced so in time or space nearby, what's

275
00:20:49,571 --> 00:20:55,369
going to happen is a cache is going to sit
somewhere here before you get to a main

276
00:20:55,369 --> 00:20:58,882
memory system.
And, when you go to access a value, it's

277
00:20:58,882 --> 00:21:02,300
going to memorize that and keep that value
nearby.

278
00:21:02,300 --> 00:21:07,916
So, if you go to access it again, in time,
and if your program exhibits temporal

279
00:21:07,916 --> 00:21:13,736
locality, that data value will be close by
and you won't have to go out to the big

280
00:21:13,736 --> 00:21:18,567
slow RAM.
And, caches can exploit spatial locality.

281
00:21:18,567 --> 00:21:25,276
And, the way caches go about exploiting
spacial locality is that, when you go to

282
00:21:25,276 --> 00:21:31,473
access a particular piece of data, your
cache is going to pull in data around that

283
00:21:31,473 --> 00:21:36,000
piece of data.
And, we'll talk more about this next time

284
00:21:36,000 --> 00:21:39,550
how, how that works.
But, we're going to segment data into what

285
00:21:39,550 --> 00:21:45,750
we call, cache blocks, or cache lines.
And, that's typically, those, those cache

286
00:21:45,750 --> 00:21:50,297
blocks or cache lines are aligned.
And, when you go to access a piece if

287
00:21:50,297 --> 00:21:53,884
data, it might be somewhere at the
beginning, or at the end.

288
00:21:53,884 --> 00:21:57,940
But, or might be in the middle, you pull
in all the data around that.

289
00:21:57,940 --> 00:22:02,564
And the, the heuristic here is that if you
access some piece of data, let's say, you

290
00:22:02,564 --> 00:22:07,547
only go access the first byte in a cache
line, you're likely to go access other

291
00:22:07,547 --> 00:22:11,069
bytes in a block.
So, multiple, pieces of data around that.

292
00:22:11,069 --> 00:22:27,077
So, we are able to exploit spatial
locality here.