1
00:00:03,280 --> 00:00:09,424
Okay, now we, now we get to move on to the
meat of today, we're going to talk about

2
00:00:09,424 --> 00:00:12,727
non-blocking caches.
Also known as out of order memory systems.

3
00:00:12,727 --> 00:00:16,723
Also known as lock up free caches.
So I think the first paper that actually

4
00:00:16,723 --> 00:00:19,281
published on this called the Lockup Free
Cache.

5
00:00:19,281 --> 00:00:22,424
Lots of people call these things
non-blocking caches today.

6
00:00:22,584 --> 00:00:26,847
And if you think about it from a memory
perspective it's an out of order memory

7
00:00:26,847 --> 00:00:32,658
system.
What is non-blocking cache allow you to

8
00:00:32,658 --> 00:00:36,185
do?
Well, it enables you to have subsequent

9
00:00:36,185 --> 00:00:42,977
memory operations occurring from the main
process or pipeline even when you have a

10
00:00:42,977 --> 00:00:45,656
miss that was earlier in your instruction
sequence.

11
00:00:45,656 --> 00:00:49,910
So all pipelines we looked at up to this
point, you take it, even our out of order

12
00:00:49,910 --> 00:00:53,744
pipelines, we talked about them as
basically saying a cache miss you just

13
00:00:53,744 --> 00:00:57,736
sort of stopped the pipe because we
couldn't deal with having memory coming

14
00:00:57,736 --> 00:01:00,782
out of order even to our out of order
processor pipelines.

15
00:01:00,782 --> 00:01:03,145
We didn't have enough bits to track all
that.

16
00:01:03,145 --> 00:01:07,137
Well now we're going to talk about
structures that allow us to track out of

17
00:01:07,137 --> 00:01:13,553
order memory.
So, two major things this allows you to

18
00:01:13,553 --> 00:01:17,654
do.
It allows you to have a cache hit under a

19
00:01:17,654 --> 00:01:22,120
miss, and it allows you to have a miss
under miss.

20
00:01:22,520 --> 00:01:26,531
So what do I mean by miss under-miss?
Well, what I mean by miss under-miss is

21
00:01:26,531 --> 00:01:28,114
you, do access.
Like say a load.

22
00:01:28,114 --> 00:01:30,120
It some address.
It takes a cache miss.

23
00:01:30,960 --> 00:01:35,788
And you have a load, a secondary load but
you keep executing your program.

24
00:01:35,788 --> 00:01:41,477
You do another load and that takes a cache
miss also and you are able to process both

25
00:01:41,477 --> 00:01:46,769
these things if you have a non-blocking
cache which allows you to have miss under

26
00:01:46,769 --> 00:01:50,203
miss.
One of the big points I want to get across

27
00:01:50,203 --> 00:01:53,806
today is that this is not only for out of
order processors.

28
00:01:53,806 --> 00:01:56,615
We talked a lot about out of order
processors.

29
00:01:56,615 --> 00:02:01,440
Believe it or not, you can actually hook
up a non blocking cache to an in order

30
00:02:01,440 --> 00:02:04,127
processor.
Or even a VLIW in order processor.

31
00:02:04,127 --> 00:02:06,570
Now, how do you go about doing that?
Well.

32
00:02:06,570 --> 00:02:11,510
Two, two major ways that you can do that
one is when you take the cache miss, you

33
00:02:11,510 --> 00:02:15,114
mark the variable, or you mark the
register as not being there.

34
00:02:15,114 --> 00:02:18,078
So when you go to actually read the data,
you block.

35
00:02:18,078 --> 00:02:21,856
So we'll show an example of that in, in
few slides up from now.

36
00:02:21,856 --> 00:02:25,983
But, really what I was trying to get a
across is you can have either of

37
00:02:25,983 --> 00:02:28,540
processors with out of order memory
systems.

38
00:02:28,540 --> 00:02:32,725
You can have out of order processor
without out of order memory systems.

39
00:02:32,725 --> 00:02:36,285
Both these things are, are possible.
Okay.

40
00:02:36,285 --> 00:02:42,280
So, challenge list couple, couple big
challenges here.

41
00:02:44,740 --> 00:02:48,067
Hurry up.
A couple of big challenges.

42
00:02:48,344 --> 00:02:56,016
If you have multiple out of order misses,
well, the memory system is going to return

43
00:02:56,016 --> 00:03:00,185
data sort of out of order.
You have to deal with it somehow.

44
00:03:00,185 --> 00:03:04,305
It's possible you might end up in a
different memory bank, and the data comes

45
00:03:04,305 --> 00:03:06,926
back, in different orders, that you, sent
it out in.

46
00:03:06,926 --> 00:03:10,672
And this gets hard, to deal with.
You, you sent out a cache miss for x, y,

47
00:03:10,672 --> 00:03:14,256
and z and they come back in z, y, and x,
order, or something like that.

48
00:03:14,256 --> 00:03:18,109
And you need to make sure you're
delivering, the right data to the right

49
00:03:18,109 --> 00:03:21,640
location in the cache, and the right data
to the right, instruction.

50
00:03:21,640 --> 00:03:25,331
So we're going to need some big
associative table to go figure this out.

51
00:03:26,280 --> 00:03:32,699
Second problem, major challenge here is
lots of times you're going to have a load

52
00:03:32,699 --> 00:03:39,040
and a store to the address that you want
to take a load and store miss to later.

53
00:03:40,227 --> 00:03:44,988
So what do I mean by that?
Well, it's pretty that you are going to be

54
00:03:44,988 --> 00:03:49,875
doing loads sequentially through memory.
And if the first load in a cash line

55
00:03:49,875 --> 00:03:52,847
misses,
The second move's also going to miss, but

56
00:03:52,847 --> 00:03:57,814
they're on the same cache line.
You don't want to send two of those out to

57
00:03:57,814 --> 00:04:01,920
the memory system at the same time cuz you
might confuse the memory system.

58
00:04:02,900 --> 00:04:08,130
Worse, let's say if you have a load to one
address and a store to another address.

59
00:04:08,130 --> 00:04:10,745
But the they are both in the same cache
line.

60
00:04:10,745 --> 00:04:15,511
The load might go out to memory, you start
gaining new memory, you do the store the

61
00:04:15,511 --> 00:04:20,335
store goes out to memory and you actually
store on the main memory somehow but you

62
00:04:20,335 --> 00:04:25,159
bring in the original load data and they
sort of pass in transit or something like

63
00:04:25,159 --> 00:04:29,692
that, the load data, passes the store
data, and all of a sudden you are going to

64
00:04:29,692 --> 00:04:34,052
have the wrong data in your cache.
You don't have it updated, and the reason

65
00:04:34,052 --> 00:04:38,155
that ends up happening the store didn't
have anyplace to merge into.

66
00:04:38,155 --> 00:04:42,526
It couldn't actually go deposit it's data
into the cache, for instance.

67
00:04:42,526 --> 00:04:47,708
Okay, so how do we go about handling this.
But before we do that, let's, let's look

68
00:04:47,708 --> 00:04:53,134
at a, a timeline here.
Time goes from left to right on this

69
00:04:53,134 --> 00:04:56,785
graph.
At the top here we have a blocking cache.

70
00:04:56,785 --> 00:05:02,829
The blocking cache you are happily running
the cpu you do a load we'll say or store

71
00:05:02,829 --> 00:05:08,291
doesn't matter in these systems, and you
take cache miss, and the most basic

72
00:05:08,291 --> 00:05:14,190
blocking cache you're going to wait, and
wait for the cache line to get filled in.

73
00:05:14,190 --> 00:05:20,013
And then you can return the data and then
keep running the cpu, there's no overlap

74
00:05:20,013 --> 00:05:23,706
happening here.
But we want to go fast, we want to go

75
00:05:23,706 --> 00:05:27,115
faster.
If we have a bound blocking cache, we can

76
00:05:27,115 --> 00:05:32,360
take a hit under a miss.
So what that means is, we're running

77
00:05:32,360 --> 00:05:37,720
along, we take a cache miss.
This goes out to main memory.

78
00:05:38,500 --> 00:05:46,512
But we don't stop the CPU from executing.
So let's say it is a, a load for register

79
00:05:46,512 --> 00:05:50,848
five.
Well as long as no one looks at register

80
00:05:50,848 --> 00:05:55,720
five . Does the processor care that we
took cache miss to register five?

81
00:05:57,280 --> 00:06:00,139
Probably not.
Now there might be some complexities

82
00:06:00,139 --> 00:06:05,000
around if that load to register five takes
a trap or something like that, or takes an

83
00:06:05,000 --> 00:06:07,630
interrupt.
Then it might care, but let's say it

84
00:06:07,630 --> 00:06:12,548
doesn't for, for, for One of the reasons
that this is actually usually safe to do

85
00:06:12,548 --> 00:06:16,323
is that you've already done all the memory
checks and you've done.

86
00:06:16,323 --> 00:06:20,783
You're sort of passed the commit point to
that point of the processor, because

87
00:06:20,783 --> 00:06:25,015
you're already pretty late in the
processor pipe by the time you know you

88
00:06:25,015 --> 00:06:27,360
get a cache miss.
So it's not too harmful.

89
00:06:27,360 --> 00:06:32,338
But then, you keep executing here and you
take, you exit your cache again you do a

90
00:06:32,338 --> 00:06:37,289
different load, you get hit.
That's great, you just keep executing.

91
00:06:37,289 --> 00:06:41,080
And it's only when you go to use the data
do you stall.

92
00:06:41,560 --> 00:06:47,420
And effectively we've executed, we've
overlapped computation with our mispenalty

93
00:06:48,420 --> 00:06:54,508
And this, this is, this is pretty nice.
That's something else we can do is misses

94
00:06:54,508 --> 00:06:58,539
under misses.
So a miss under miss, we're executing

95
00:06:58,539 --> 00:07:03,108
along here, we take cache miss and we send
out something out to the main memory

96
00:07:03,108 --> 00:07:07,735
system to go get the data but we don't
stop executing, we keep executing, we take

97
00:07:07,735 --> 00:07:11,205
another miss and we send that out to the
main memory system.

98
00:07:11,205 --> 00:07:15,080
At some point, maybe we actually go to
look at the data or possibly.

99
00:07:15,520 --> 00:07:20,547
Maybe we don't actually look at the data
until here and the see if you just keeps

100
00:07:20,547 --> 00:07:25,037
executing the whole time.
It overlaps multiple memory accesses with

101
00:07:25,037 --> 00:07:27,920
computation.
So this can be really powerful.

102
00:07:28,880 --> 00:07:32,140
And you can do this, as I said, with
in-order processor.

103
00:07:32,700 --> 00:07:39,565
One thing I do want to say is, usually you
have a limited number, of outstanding

104
00:07:39,565 --> 00:07:42,606
memory accesses.
Some small integer.

105
00:07:42,606 --> 00:07:49,297
Maybe like four or eight This diminisher
returns as you add more and more.

106
00:07:49,558 --> 00:07:56,075
But usually this isn't sort of, thousands
of outstanding memory accesses.

107
00:07:56,075 --> 00:08:00,160
So, let's, let's look at this, the
structure here.

108
00:08:00,620 --> 00:08:03,293
There's a couple different names for this
thing.

109
00:08:03,460 --> 00:08:07,304
It depends what school of computer
architecture design you come from.

110
00:08:07,304 --> 00:08:11,816
If you are from the alpha or the Digital
Equipment Corporation design philosophy

111
00:08:11,816 --> 00:08:14,713
you're going to call this thing a miss
address file.

112
00:08:14,713 --> 00:08:18,891
If you come from the Intel school of
things you're going to call it a Miss

113
00:08:18,891 --> 00:08:22,456
Status Handling Register.
Miss Status Handling Register actually

114
00:08:22,456 --> 00:08:26,076
predates Intel goes back to I believe
Control Data Corp.

115
00:08:26,244 --> 00:08:30,756
There's a paper from Control Data Corp.
On Miss Status Handling Registers back in

116
00:08:30,756 --> 00:08:33,040
the like, I don't know, late'60s
early'70s.

117
00:08:33,040 --> 00:08:39,509
Let's look in, inside of this thing and,
and so understand whats, whats going on.

118
00:08:39,509 --> 00:08:46,227
You would have some small number of miss
status handling registers probably an

119
00:08:46,227 --> 00:08:50,140
unregistered file.
And you have a valid bit.

120
00:08:50,580 --> 00:08:55,584
Your going to have a block address.
Now this block address is not the address

121
00:08:55,584 --> 00:09:01,265
and load of the store, it's the address
and load of the store it's the address of

122
00:09:01,265 --> 00:09:04,943
the cache line That awaience level in the
store.

123
00:09:04,943 --> 00:09:09,915
Because, well when you use this structure
for us, we're going to use this to check

124
00:09:09,915 --> 00:09:14,887
subsequent memory accesses or subsequent
misses against previous ones that have

125
00:09:14,887 --> 00:09:18,903
been going on.
And we may have multiple sort of in flight

126
00:09:18,903 --> 00:09:23,747
here, we don't actually need the address
of the, the load or the store.

127
00:09:23,747 --> 00:09:28,720
We need the address of the entire line cuz
we need to check the entire line.

128
00:09:29,581 --> 00:09:33,720
Cuz that's what's in play.
We have a bit here that says whether it's

129
00:09:33,720 --> 00:09:37,168
issued or not.
Now, why, why we have that is just because

130
00:09:37,168 --> 00:09:41,996
you took a miss, doesn't mean it's
actually out of the memory system or going

131
00:09:41,996 --> 00:09:45,320
to the memory system yet.
So you sort of fill this in.

132
00:09:46,120 --> 00:09:50,023
And it's sits around there, until you have
some time to go and talk to the rain

133
00:09:50,023 --> 00:09:52,099
memory.
And hopefully that happens quickly.

134
00:09:52,099 --> 00:09:55,854
Some architectures you may not even need
this you might just stall, until it

135
00:09:55,854 --> 00:09:59,363
actually goes out to main memory.
But once issued you, you take that bit.

136
00:09:59,363 --> 00:10:03,119
And, and what this, what this allows you
to do, it allows you to have multiple

137
00:10:03,119 --> 00:10:06,528
misses, which are not issued.
So if you have a miss, under a miss, under

138
00:10:06,528 --> 00:10:10,531
a miss and it happened really quickly, you
can fill up this table quickly, and not

139
00:10:10,531 --> 00:10:14,336
actually have to wait for it to stream out
to main memory although request out to

140
00:10:14,336 --> 00:10:16,404
memory yet.
That's half of it.

141
00:10:16,404 --> 00:10:19,549
And then we have a bunch of load store
entries.

142
00:10:19,549 --> 00:10:23,631
Now with these load store entries they
also have a valid bit.

143
00:10:23,631 --> 00:10:27,513
And they have a pointer back to one of
these entries here.

144
00:10:27,513 --> 00:10:32,733
So this is a number which says sort of
which entry you are in the miss status

145
00:10:32,733 --> 00:10:37,400
registry file.
And this is for individual loads or stores

146
00:10:37,400 --> 00:10:42,605
that are occurring.
So what allows, allows it to happen is if

147
00:10:42,605 --> 00:10:50,152
you have a load miss to address zero and
you've a load miss to address I don't know

148
00:10:50,152 --> 00:10:56,326
ten And these both are in the same cash
line, you can fill in this table with two

149
00:10:56,326 --> 00:11:01,297
entries and they'll actually both point
back to the same miss status handling

150
00:11:01,297 --> 00:11:04,420
register location, they'll have different
offsets.

151
00:11:04,420 --> 00:11:09,490
And the, the destination field, you'll
fill, you'll fill in which register on the

152
00:11:09,490 --> 00:11:13,918
processor they're destined for.
So what we're going to do is we're going

153
00:11:13,918 --> 00:11:18,731
to use these table such that when a load
comes back, or excuse me, when a cache

154
00:11:18,731 --> 00:11:24,058
line comes back from main memory, we'll be
able to check in these two tables and do

155
00:11:24,058 --> 00:11:27,459
two things.
One, we'll be able to fill it into a cache

156
00:11:27,459 --> 00:11:30,733
somewhere.
And two, we'll be able to return the data

157
00:11:30,733 --> 00:11:37,144
to the correct destinations..
And we'll be able to find which piece of

158
00:11:37,144 --> 00:11:40,240
data we need, given the offset in the
cache line.

159
00:11:42,120 --> 00:11:48,174
That, that, that's a mouthful to say.
So this is a little bit of a complicated

160
00:11:48,174 --> 00:11:54,388
data structure here to actually work out.
Now I want to point out where the

161
00:11:54,388 --> 00:12:01,240
associativity, associative matches versus
indexed matches happen in these tables.

162
00:12:01,240 --> 00:12:07,932
So one of the, the things you might notice
is that this block address, we're going to

163
00:12:07,932 --> 00:12:12,390
have to check every subsequent miss.
Against this block address.

164
00:12:12,390 --> 00:12:16,573
So we're going to take the high, higher
order bits of the address, check it

165
00:12:16,573 --> 00:12:19,637
against this.
And the checks that, we don't add a new

166
00:12:19,637 --> 00:12:22,759
entry here.
Instead we just merge it into a currently

167
00:12:22,759 --> 00:12:26,000
pending one.
But we do have to add new load store entry

168
00:12:26,000 --> 00:12:33,217
for that, which points back over here.
When a memory transaction comes back from

169
00:12:33,217 --> 00:12:38,179
main memory, we're going to look back in
this table and say, uh-oh,

170
00:12:38,179 --> 00:12:43,600
Here's the one that I, I had issued.
I need to clear it out of the table.

171
00:12:44,680 --> 00:12:48,857
And given the number of the location I'm
clearing out a table.

172
00:12:48,857 --> 00:12:54,248
I'm going to associatively check that
against all of the load store entries and

173
00:12:54,248 --> 00:12:58,493
wake up the ones that match.
So, if this is entry number one and

174
00:12:58,493 --> 00:13:03,749
there's multiple number ones in here.
All of the ones that have number ones in

175
00:13:03,749 --> 00:13:07,051
here, I need to return that data to the
registers.

176
00:13:07,051 --> 00:13:12,037
Now when I return the data to the
register, I have to mark the register as

177
00:13:12,037 --> 00:13:19,407
being available again.
And the type here just basically says, you

178
00:13:19,407 --> 00:13:25,400
know, is it word, half word, bytes, and is
it a load versus a store.

179
00:13:25,780 --> 00:13:33,436
One other thing I wanted to say is
destination here. For loads it's going to

180
00:13:33,436 --> 00:13:37,203
be a register identifier.
Now this might be a register identifier

181
00:13:37,203 --> 00:13:42,013
versus a virtual register identifier or a
optical register identifier if you had a

182
00:13:42,013 --> 00:13:46,823
registry neighbor. So you have a out of
order processor here, this might get more

183
00:13:46,823 --> 00:13:49,720
complicated.
This is typically a physical register

184
00:13:49,720 --> 00:13:52,502
identifier, not a artificial register
identifier.

185
00:13:52,502 --> 00:13:56,756
For stores.
You also effectively need to track

186
00:13:56,756 --> 00:14:00,985
something in this table.
Because you need to merge and store with

187
00:14:00,985 --> 00:14:04,888
the background data.
So, this going to, basically, add an entry

188
00:14:04,888 --> 00:14:08,335
in here.
It's going to sound to main memory, go get

189
00:14:08,335 --> 00:14:11,978
the background data.
And when that comes back, it's going to

190
00:14:11,978 --> 00:14:18,073
deposit that into our cache.
And we need to point at some other buffer.

191
00:14:18,073 --> 00:14:22,652
This is very similar, we had that future
store buffer.

192
00:14:22,652 --> 00:14:26,331
You can have an array for those, for
instance.

193
00:14:26,331 --> 00:14:32,381
And it'll tell you which one to go play
the store against our cache width.

194
00:14:32,381 --> 00:14:37,777
And you can merge the return data with
the, the store data, that we've stored

195
00:14:38,022 --> 00:14:40,525
locally.
So this is kinda fun.

196
00:14:40,525 --> 00:14:43,093
We can have memory coming back out of
order.

197
00:14:43,093 --> 00:14:45,720
We can have memory being issued out of
order.

198
00:14:46,520 --> 00:14:51,427
Lot's of, lot's of fun, fun toys here and
one of the fun things we can do is we, we

199
00:14:51,427 --> 00:14:56,636
have ability cuz the associated check here
to check to make sure that the data coming

200
00:14:56,636 --> 00:15:01,059
back or assuming subsequent requests hit
or miss here and, and will merge.

201
00:15:01,059 --> 00:15:06,087
So we don't have to generate more memory
traffic or cause strange things happening

202
00:15:06,087 --> 00:15:11,115
where you have responses coming back and
new requests going out from the same line

203
00:15:11,115 --> 00:15:16,445
and you might lose some data in flight.
I should put up this is only one

204
00:15:16,445 --> 00:15:19,910
implementation.
Lot's of times, what would people do is

205
00:15:19,910 --> 00:15:24,711
in, in this miss status handling register,
there will actually be a tag field because

206
00:15:24,711 --> 00:15:29,634
main memory will not necessarily keep
track of the entire address when you get a

207
00:15:29,634 --> 00:15:33,342
response back from main memory.
Instead, it'll just have a tag.

208
00:15:33,342 --> 00:15:38,386
So, instead of checking against the block
address you might check against a smaller

209
00:15:38,386 --> 00:15:41,375
tag.
That's one way people sort of get make

210
00:15:41,375 --> 00:15:45,281
this a little bit easier.
Another way most people do this is the

211
00:15:45,281 --> 00:15:50,223
tag, if you have a small number of cores
might actually be which entry you are in

212
00:15:50,223 --> 00:15:53,579
this table so you don't even have to
associative check.

213
00:15:53,579 --> 00:15:58,339
That's sort of optimization on this.
You still need to do associative check

214
00:15:58,339 --> 00:16:03,160
when future loads and stores that miss go
out to routinely check this table.

215
00:16:04,020 --> 00:16:08,078
I think I, I think I swapped through most
of this, already.

216
00:16:08,292 --> 00:16:12,849
On cache miss you have to check the table
for a matched address.

217
00:16:12,849 --> 00:16:19,257
If found you allocate a new entry that
points to the miss handled entry register.

218
00:16:19,257 --> 00:16:24,811
If not, you have to both a load entry load
store entry, and miss status handling

219
00:16:24,811 --> 00:16:28,221
register entry.
One thing I did want to point out here is

220
00:16:28,221 --> 00:16:32,123
if you run out of miss status handling
registers or load storer entries, that's

221
00:16:32,123 --> 00:16:35,088
not the end of the world.
You can just stop the processor.

222
00:16:35,088 --> 00:16:39,250
So let's say you have eight of them and
you run out of all eight, you have eight

223
00:16:39,250 --> 00:16:42,424
outstanding memory transactions.
You can just stop for awhile.

224
00:16:42,424 --> 00:16:46,118
It's going to come back at some point.
One of the loads, or one of the main

225
00:16:46,118 --> 00:16:50,020
memory accesses are going to come back.
So you can just stall for a little bit.

226
00:16:50,280 --> 00:16:55,281
I need a return, need to find the load in
store that's waiting for it.

227
00:16:55,498 --> 00:16:59,920
Going back to what, Berchun said it's very
possible that a,

228
00:17:00,880 --> 00:17:05,750
The loading store that was waiting on it
might've actually disappeared in that time

229
00:17:05,750 --> 00:17:08,417
period.
Well, at least for a load, cuz you might

230
00:17:08,417 --> 00:17:12,417
have write after write occurs.
That's okay, you still want to fill it in

231
00:17:12,417 --> 00:17:16,302
your cache at that point.
And of course you can have multiple loads

232
00:17:16,302 --> 00:17:20,585
and stores.
When the catch line is completely returned

233
00:17:20,585 --> 00:17:26,072
and you finished checking against all the
load and store entries you would

234
00:17:26,072 --> 00:17:31,340
deallocate both load and store entries and
mustache handling registries.

235
00:17:33,560 --> 00:17:39,143
Okay, so little bit of oops, fun here with
in order machines so you can sort of see

236
00:17:39,143 --> 00:17:43,515
how this relatively logically fits into an
out of order pipeline.

237
00:17:43,515 --> 00:17:48,829
If you were to fit this into an in order
pipeline, you can its not too hard, you

238
00:17:48,829 --> 00:17:52,798
can actually add a scoreboard for each
individual register.

239
00:17:52,798 --> 00:17:57,911
And when I say scoreboard, you're not
tracking where the data is coming from,

240
00:17:57,911 --> 00:18:02,018
instead there's a special bit saying,
This register is out to lunch.

241
00:18:02,018 --> 00:18:06,443
This register is out of the memory system.
If you try to go access it, just wait or

242
00:18:06,443 --> 00:18:08,979
just stall.
And then when the memory, cuz it's a

243
00:18:08,979 --> 00:18:13,296
variable length sort of thing, so your
scoreboard can have this will be ready in

244
00:18:13,296 --> 00:18:17,020
five seconds." You don't know.
It's out of the memory, the memory system.

245
00:18:17,540 --> 00:18:21,312
Unload miss here.
You can mark the, you mark the destination

246
00:18:21,312 --> 00:18:24,780
register as busy.
When it come back, you mark it available

247
00:18:24,780 --> 00:18:29,100
and uninstall the processor.
But if no one actually went to go use that

248
00:18:29,100 --> 00:18:33,786
register, in the mean time while it was
out of main memory, the main processor

249
00:18:33,786 --> 00:18:39,640
never stalls and no ones ever the wiser.
Okay, so, I wanted to, we're almost out of

250
00:18:39,640 --> 00:18:43,777
time, so I wanted to, pick it up here for
a second.

251
00:18:43,777 --> 00:18:49,058
So, non-blocking, caches, they can
effectively increase, the, bandwidth to

252
00:18:49,058 --> 00:18:51,793
your, lower levels of caches, your sort
of, L1's.

253
00:18:51,971 --> 00:18:56,668
The only thing they can do is they can
increase the bandwidth because they can

254
00:18:56,668 --> 00:19:00,472
merge, misses to your cache.
Now you probably would have actually

255
00:19:00,472 --> 00:19:05,049
gotten that anyway, if you had a blocking
cache, but, the missed cache handling

256
00:19:05,049 --> 00:19:09,686
register basically, allows you to have
multiple cache misses merge into one

257
00:19:09,686 --> 00:19:13,571
transaction.
Your missed penalty is obviously lower

258
00:19:13,571 --> 00:19:17,895
because, going back to this picture here,
we've overlapped the missed penalty of

259
00:19:17,895 --> 00:19:18,880
other useful work.