1
00:00:03,460 --> 00:00:07,130
Now, we get in the meat of today, today's
lecture.

2
00:00:07,130 --> 00:00:12,600
And we're going to talk about all these
different advanced cache optimizations,

3
00:00:12,600 --> 00:00:18,820
starting with pipelining the cache.
Okay. So, here is our.

4
00:00:19,440 --> 00:00:24,614
Sketch of what happens in the cache.
In the cache, we have a tag and we have

5
00:00:24,614 --> 00:00:29,236
the data blocks here.
And we want to go do write into the cache.

6
00:00:21,927 --> 00:00:33,928
So, forget about the read for a second,
But instead let's focus on the write.

7
00:00:33,928 --> 00:00:37,170
So, in the write, you're going to do a tag
check.

8
00:00:37,170 --> 00:00:42,551
So, you're going to index into here with
the index into your tag array, it's going

9
00:00:42,551 --> 00:00:46,691
to spit out the tag.
You want to check the valid bit also,

10
00:00:46,691 --> 00:00:55,491
And you're going to do a comparison here.
If it matches, that means that the data's

11
00:00:55,491 --> 00:01:01,417
already in the cache and you're doing a
write, so you can just write the data into

12
00:01:01,417 --> 00:01:06,300
the cache.
So, that, that sounds good.

13
00:01:06,920 --> 00:01:10,434
Some challenges here is, it really is
sequential.

14
00:01:10,434 --> 00:01:16,237
You need to absolutely do this first
before you do the write. Because if you

15
00:01:16,237 --> 00:01:21,689
have a different address stored in this
location here and is blindly to the write,

16
00:01:21,689 --> 00:01:26,342
You are going to overwrite the wrong data.
And you're just going to be very unhappy

17
00:01:26,342 --> 00:01:31,461
because we overwrote some data which
wasn't, wasn't a portion to this, this

18
00:01:31,461 --> 00:01:34,520
particular write or is that the wrong
address.

19
00:01:34,635 --> 00:01:44,996
So that's, that's definitely a bummer so
do we think we will want to do this in one

20
00:01:44,996 --> 00:02:01,920
cycle?
Anyone have, want to wager a guess here?

21
00:02:02,620 --> 00:02:05,800
Do people build machines where they do
this in one cycle?

22
00:02:10,000 --> 00:02:13,696
Well, sure. Original machines did do this
one cycle.

23
00:02:13,696 --> 00:02:19,314
It's possible to build, you have a nice
combinational path that sort of come

24
00:02:19,314 --> 00:02:25,080
through out of the index through here,
through all this logic. And then, through

25
00:02:25,080 --> 00:02:30,772
the write enable and then through here.
It's buildable,, not great for your clock

26
00:02:30,772 --> 00:02:36,820
performance.
So, the first optimization we're going to

27
00:02:37,173 --> 00:02:43,160
so, so, how do we reduce the, hit time for
a write?

28
00:02:43,440 --> 00:02:47,045
And there's a couple different strategies
for this.

29
00:02:47,045 --> 00:02:52,488
So, we can think of this as trying to do
this over two different cycles.

30
00:02:52,488 --> 00:02:58,426
The first cycle checks the tag, we'll say,
and the second cycle does something to the

31
00:02:58,426 --> 00:03:03,329
data.
That's, that's, that's kind of our problem

32
00:03:03,329 --> 00:03:06,200
statement.
What are our solutions?

33
00:03:07,100 --> 00:03:14,252
One, one solution is kind of innovative,
is you can build a multi-ported cache, or

34
00:03:14,252 --> 00:03:20,780
cache which can simultaneously do a read
and a write to the same address,

35
00:03:21,380 --> 00:03:26,872
And you do that.
So you have a cache, you're doing a write.

36
00:03:26,872 --> 00:03:29,959
You blindly do the write not checking the
tag.

37
00:03:29,959 --> 00:03:34,860
But, at the same time, you rebuild data
that was there and you save it off.

38
00:03:35,400 --> 00:03:43,567
And if you took a tagness for the write,
you go back and actually fix it up by

39
00:03:43,567 --> 00:03:46,638
filling that back in.
That is an option,

40
00:03:46,638 --> 00:03:52,700
People have built such things. and so you
just restore the old value on a tagness

41
00:03:52,700 --> 00:03:55,703
Works perfectly fine, you need a side
buffer.

42
00:03:55,907 --> 00:04:01,367
This kinda looks like a victim cache,
which we'll talk about later in today's

43
00:04:01,367 --> 00:04:04,712
lecture.
But you can do a write and a read at the

44
00:04:04,712 --> 00:04:09,148
same time and it's kinda like you're
putting the, the new data in,

45
00:04:09,148 --> 00:04:12,561
speculatively, pulling out the old data,
the victim.

46
00:04:12,561 --> 00:04:17,679
And, if the write hits in the, the, the
tag check, then you can just let it go and

47
00:04:17,679 --> 00:04:21,706
everything is great.
If not, you have to basically pull that

48
00:04:21,706 --> 00:04:23,813
back out and undo it.
It's doable.

49
00:04:23,813 --> 00:04:28,660
Not a lot of people do this, but people
have built machines like that.

50
00:04:29,620 --> 00:04:36,060
Another solution is trying to build a
fully associative cache.

51
00:04:36,060 --> 00:04:39,534
Now, what do I mean by fully associative
cache?

52
00:04:39,534 --> 00:04:45,954
Well, I mean, a cache which has lines such
that you can store it to any location in

53
00:04:45,954 --> 00:04:49,768
the cache.
And in those structures, typically, the

54
00:04:49,768 --> 00:04:54,356
tag check and the, access are kind of done
in parallel.

55
00:04:54,356 --> 00:05:00,554
So, because, because you don't know where
to find the data, there's no index

56
00:05:00,554 --> 00:05:04,579
operation.
The index operation is the tag check, or

57
00:05:04,579 --> 00:05:09,570
the cam operation, or the content
addressable memory operation.

58
00:05:09,570 --> 00:05:14,819
You're basically doing it in parallels or,
or you're basically doing the tag check to

59
00:05:14,819 --> 00:05:18,661
do the read anyway.
It doesn't really hurt you to sort of do

60
00:05:18,661 --> 00:05:21,350
the tag check, check first, and then do
the write.

61
00:05:21,350 --> 00:05:27,112
So the, it just doesn't really, hurt you.
It's not necessarily a great design if you

62
00:05:27,112 --> 00:05:31,465
have, a very large cache.
But for a small cache, you can definitely

63
00:05:31,465 --> 00:05:34,667
have a content addressable memory, for the
tags,

64
00:05:34,667 --> 00:05:39,788
And then just have it so you can write
anywhere if you have a fully associative

65
00:05:39,788 --> 00:05:42,477
cache.
That is one way to solve this write

66
00:05:42,477 --> 00:05:45,489
problem.
And then, we're going to look at this is

67
00:05:45,489 --> 00:05:49,800
what we're going to focus on in today's
lecture is pipelining the write.

68
00:05:50,700 --> 00:05:56,774
So, to pipeline the write, instead of
actually doing the write in the cycle in

69
00:05:56,774 --> 00:06:02,060
the mem, M stage of your pipeline, we're
going to tag check the M stage.

70
00:06:02,340 --> 00:06:07,620
And then, hold the store data for some
time in the future.

71
00:06:07,920 --> 00:06:11,220
Now, you're going to say, did we actually
do the store then?

72
00:06:11,220 --> 00:06:14,593
Well, yes. We're going to call that
committed state.

73
00:06:14,593 --> 00:06:19,974
So this is what we're going to focus on
today is, how to, how to do a pipelined

74
00:06:19,974 --> 00:06:23,060
memory access to reduce our write hit
time.

75
00:06:23,660 --> 00:06:31,972
So here, we have our five stage pipeline.
Cough And, what we're going to do is we

76
00:06:31,972 --> 00:06:37,160
are going to the M stage for write, we're
going to check the tag,

77
00:06:38,580 --> 00:06:42,860
And we're not going to fire up the data
array at all.

78
00:06:44,140 --> 00:06:47,026
Now, you might say, where do you put the
data?

79
00:06:47,026 --> 00:06:53,302
Well, we put a nice little buffer here
COUGH which is a, basically to be stored

80
00:06:53,302 --> 00:06:58,246
buffers.
Stores the data that's going to be stored

81
00:06:58,246 --> 00:07:04,180
in the future or a delayed cache buffer as
we're going to call this.

82
00:07:05,320 --> 00:07:10,360
So in parallel, you check the tag, and you
store the data here.

83
00:07:10,960 --> 00:07:17,080
And then, sometime in the future, you want
to move this into the cache.

84
00:07:18,580 --> 00:07:24,851
But you need a convenient time to do that.
So, one option is you wait for a dead

85
00:07:24,851 --> 00:07:31,880
cycle on the cache.
Okay. That sounds, sounds good.

86
00:07:32,880 --> 00:07:35,340
How do you know you're going to get a dead
cycle?

87
00:07:38,220 --> 00:07:41,173
I can't guarantee that.
I can write a piece of code which does

88
00:07:41,173 --> 00:07:44,603
store, after store, after store, after
store, after store, after store, after

89
00:07:44,603 --> 00:07:48,176
store, after store, after store, after
store, in a really tight loop so you'll

90
00:07:48,176 --> 00:07:58,202
never have a dead cycle.
A cool trick here is a subsequent store

91
00:07:58,202 --> 00:08:05,385
only has to use the tag array.
So, if you have a store after a store, the

92
00:08:05,385 --> 00:08:09,568
first store will check the tag, and not
use the data,

93
00:08:09,568 --> 00:08:14,440
We'll destroy the data here. And then,
when the second store comes down the pipe,

94
00:08:15,180 --> 00:08:21,660
It'll check the tag but the data array is
open so you can do the store at that time.

95
00:06:16,245 --> 00:08:27,086
So, there's a cool trick there.
If you have a store, after a store, after

96
00:08:27,086 --> 00:08:33,337
a store, you could basically decouple the
tag from the data for stores and use the

97
00:08:33,337 --> 00:08:39,000
port on the cache to do, the, the data ray
of the cache to do the store later.

98
00:08:39,560 --> 00:08:46,056
And we'll go through this a little more
detailed drawing here, and this is just

99
00:08:46,056 --> 00:08:52,220
kind of to redirect what was going on. You
do the store, it checks the tags.

100
00:08:52,500 --> 00:08:57,500
Cough And it's going to store the address
and the data.

101
00:08:58,560 --> 00:09:04,110
And at some point in the future, when
there's a idle cycle or another store

102
00:09:04,110 --> 00:09:08,920
going on, it'll actually transition this
data into the data array.

103
00:09:11,940 --> 00:09:15,880
That sounds good.
Okay. So,

104
00:09:16,360 --> 00:09:21,079
Pop quiz question here.
What happens when you do a load and

105
00:09:21,079 --> 00:09:23,940
there's data in the delayed write data
buffer?

106
00:09:28,540 --> 00:09:32,853
You have to bypass it, yes, you need to go
check it.

107
00:09:32,853 --> 00:09:39,754
So, you need to have something which is
going to check, and that's actually drawn

108
00:09:39,754 --> 00:09:44,585
here. COUGH The delayed write address
against a future, load.

109
00:09:44,585 --> 00:09:51,400
And if you get a hit there, you need to
have this data come around and come out.

110
00:09:52,140 --> 00:09:55,020
So, we need basically to go check this
buffer.

111
00:09:55,582 --> 00:10:00,326
One, one thing I do want to point, this is
kind of the naive way to do this.

112
00:10:00,508 --> 00:10:05,312
More advanced processors will actually
typically have a multi entry version of

113
00:10:05,312 --> 00:10:09,752
these delayed write data buffers because
caches are sort of usually pipelined

114
00:10:10,117 --> 00:10:11,759
inside the cache, too.
And I think multiple access, multiple

115
00:10:12,610 --> 00:10:17,537
cycles to actually access the data array.
So, you might not do this until the end of

116
00:10:17,537 --> 00:10:20,091
the pipe.
So, some of the processors that I've

117
00:10:20,395 --> 00:10:25,078
worked on, these are sort of on the
delayed write buffers or sort of the end,

118
00:10:25,078 --> 00:10:29,631
order of maybe two to four entries big.
So, you can abstract this and make this

119
00:10:29,631 --> 00:10:33,993
into a bigger data structure, and when you
do loads, you obviously have to do a

120
00:10:33,993 --> 00:10:39,367
content addressable match against all of
those entries and see if the datas in the

121
00:10:39,367 --> 00:10:44,328
delayed write data buffer.
Okay.

122
00:10:44,328 --> 00:10:54,807
So let's go see how well this does.
Pipelining our cache. So let's, so, so,

123
00:10:54,807 --> 00:10:58,288
we're going to use this throughout today's
lecture,

124
00:10:58,288 --> 00:11:03,800
Ways that it makes, makes life better.
You can either reduce the miss rate.

125
00:11:04,300 --> 00:11:09,397
So do something good from miss rate,
you're going to do something good for

126
00:11:09,397 --> 00:11:15,475
reducing the missed penalty.
You can reduce the hit time and this would

127
00:11:15,475 --> 00:11:19,653
be like, for instance, having a smaller
cache reduces the hit time, or you can

128
00:11:19,653 --> 00:11:23,775
increase the bandwidth to your cache.
But, all these things are going to factor

129
00:11:23,775 --> 00:11:26,393
into performance.
And not all of the techniques,

130
00:11:26,393 --> 00:11:30,784
optimizations we're going to talk about
today touch all of these.

131
00:11:30,784 --> 00:11:35,360
In fact, most of them only touch sort of
one at a time or two at a time.

132
00:11:36,240 --> 00:11:39,640
Okay.
So, pipeline,

133
00:11:40,260 --> 00:11:49,357
Cache caches, what do they do for us?
Are they going to reduce the miss rate? I

134
00:11:49,357 --> 00:11:53,820
see people shaking their heads, no.
Yeah, we didn't make anything bigger or

135
00:11:53,820 --> 00:11:56,815
smaller here.
It's probably not going to touch the, the

136
00:11:56,815 --> 00:12:02,040
miss rate.
Is it going to affect our miss penalty?

137
00:12:03,460 --> 00:12:08,851
So, this is, when we're missing the cache,
How long it takes us to go to, let's say,

138
00:12:08,851 --> 00:12:11,100
the main memory or the next level of
cache?

139
00:12:12,560 --> 00:12:15,632
Hm, no.
It's not actually, it's not actually going

140
00:12:15,632 --> 00:12:19,472
anywhere different.
The, effectively, this is implementing the

141
00:12:19,472 --> 00:12:25,940
exact same thing.
Hit time, does this affect the hit time?

142
00:12:28,220 --> 00:12:34,265
What's going to affect the hit time, it's
kind of hard to see whether this is done

143
00:12:34,265 --> 00:12:40,386
in a positive or negative direction here.
If you compare to having the, let's say,

144
00:12:40,386 --> 00:12:48,149
tag access and data access on the same
cycle, pipelining is actually going to

145
00:12:48,149 --> 00:12:53,728
make the hit time, hm,
Make the hit time a little bit better

146
00:12:53,728 --> 00:12:59,486
because you're doing less in a cycle. So,
that would, that would have us put a plus

147
00:12:59,486 --> 00:13:04,889
here. But, if you compare it to, let's
say, not having a cache or just having a

148
00:13:04,889 --> 00:13:10,790
big, RAM there, it actually makes the hit
time worse because we need to mux in

149
00:13:10,790 --> 00:13:15,553
multiple, extra stuff here.
We need to basically do this extra check.

150
00:13:15,553 --> 00:13:21,454
We need to do the, associative check
against our, delayed write buffer data so

151
00:13:21,454 --> 00:13:25,936
that actually hurts us a little bit there.
So, okay.

152
00:13:25,936 --> 00:13:33,643
That might be a, a plus or a minus.
The bandwidth, though, is actually getting

153
00:13:33,643 --> 00:13:39,812
better through this cache because before,
we had to just sort of wait for this cache

154
00:13:39,812 --> 00:13:44,407
for, let's say,
The whole time we go through tag check and

155
00:13:44,407 --> 00:13:49,711
the data check and the data write. But
now, we can actually have two stores sort

156
00:13:49,711 --> 00:13:54,886
of happening at the same time, or the tag
for one store and the data for a different

157
00:13:54,886 --> 00:13:57,378
store.
So, this is really going to make the

158
00:13:57,378 --> 00:14:03,495
bandwidth better for this.
So that's, that's where we are with

159
00:14:03,729 --> 00:14:04,980
pipeline caches.