1
00:00:03,520 --> 00:00:05,985
So that is our first optimization
technique.

2
00:00:05,985 --> 00:00:10,170
Let us, if there are no questions, we will
move on to our second optimization

3
00:00:10,170 --> 00:00:12,692
technique.
Your book lists ten optimization

4
00:00:12,692 --> 00:00:15,614
techniques.
We are only going to cover I think seven

5
00:00:15,616 --> 00:00:20,088
of them between today and the next lecture
because I think some of them are not

6
00:00:20,088 --> 00:00:24,101
actually that important but we are also
going to cover some other ones which I

7
00:00:24,101 --> 00:00:29,645
think are important. okay so next thing we
are going to look at, the next

8
00:00:29,645 --> 00:00:38,604
optimization we are going to look at is
how to deal with hits or excuse me, how to

9
00:00:38,604 --> 00:00:46,682
deal with read misses in the cache that
have some data there that needs to get

10
00:00:46,682 --> 00:00:52,279
kicked out of the cache.
So here we have our CPU, L1 data cache,

11
00:00:52,279 --> 00:00:58,140
our next level of cache here will say a
little L, L2 cache or maybe main memory

12
00:00:59,800 --> 00:01:06,100
There is something in this cache, at a
particular line and it is dirty in the

13
00:01:06,100 --> 00:01:08,909
cache.
So the dirty bit is set.

14
00:01:08,909 --> 00:01:15,761
It has state we cannot throw out.
We do a read and it aliases that same

15
00:01:15,761 --> 00:01:20,446
location and we need to evict that line.
It will create a victim.

16
00:01:20,446 --> 00:01:26,302
In a naive implementation, we'd actually
have to sit there and wait for all that

17
00:01:26,302 --> 00:01:31,499
data to go out to main memory while we get
through the next while

18
00:01:31,499 --> 00:01:34,940
We go do the read and get the data and
fill it in.

19
00:01:35,640 --> 00:01:39,008
Okay,
That is, that is, that is not pretty good.

20
00:01:39,008 --> 00:01:43,179
We sort of have to wait for this evicted
dirty line.

21
00:01:44,540 --> 00:01:49,499
We will talk about that in a second.
So the processor could be just stalled,

22
00:01:49,499 --> 00:01:53,644
waiting on rights.
One thing you do is actually have to read

23
00:01:53,644 --> 00:01:57,500
misses.
Go beyond the rights, sort of pass the

24
00:01:57,500 --> 00:02:01,552
rights and pass the rights going out to
the unified L2.

25
00:02:01,552 --> 00:02:07,742
But one of the problems here is you do not
have that many ports onto this unified L2.

26
00:02:07,742 --> 00:02:10,837
You only have one port out here we will
say.

27
00:02:10,837 --> 00:02:16,142
So if you have the load sort of pass it
and if you do this little dance that when

28
00:02:16,142 --> 00:02:20,564
the read data comes back or load data
comes back, you need to have a place to

29
00:02:20,564 --> 00:02:24,951
put it.
So either at that point you might need to

30
00:02:24,951 --> 00:02:29,289
wait for it to go out.
To main memory or go out to the next level

31
00:02:29,289 --> 00:02:32,157
of cache.
So you cannot really get around this.

32
00:02:32,157 --> 00:02:37,827
So you can say oh I'll try to get my load
out early and just sort of worry about it

33
00:02:37,827 --> 00:02:40,758
later.
Yeah but that does not work if that load

34
00:02:40,758 --> 00:02:45,601
hits in your next level of cache.
You need to access the next level of cache

35
00:02:45,601 --> 00:02:50,698
and you are waiting for this data to go
out, because you need someplace to put it.

36
00:02:50,698 --> 00:02:55,923
So the solution to this is we put a little
buffer between our L1 and our L2 or L1 and

37
00:02:56,178 --> 00:03:02,282
our main memory and this will hold writes
or victims that go out from the L1 to the

38
00:03:02,282 --> 00:03:05,367
L2 And now, we have someplace to put the
data.

39
00:03:05,367 --> 00:03:10,715
So, if we wanted to do this fast, we would
actually do the load, we would miss our L1

40
00:03:10,715 --> 00:03:13,525
cache,
We would send that request out to here.

41
00:03:13,525 --> 00:03:18,667
And then, instantaneously, we would start
evicting the line into this buffer.

42
00:03:18,667 --> 00:03:24,152
And the reason we cannot evict it here is
because the load is actually using that,

43
00:03:24,152 --> 00:03:28,060
that data right now or it is using the L2
cache, we will say.

44
00:03:28,640 --> 00:03:34,943
But we have some place to put the victim
data, and when the load comes back, we can

45
00:03:34,943 --> 00:03:41,119
put it into the L1 data, array.
So this brings up a whole bunch of

46
00:03:41,119 --> 00:03:44,557
problems.
Biggest one being, at some point, you

47
00:03:44,557 --> 00:03:49,140
actually can get transition, from the
right buffer, into the L2 cache.

48
00:03:50,720 --> 00:03:53,886
Hmm.
You could do that if you have extra time.

49
00:03:53,886 --> 00:03:57,966
So you could just have some circuits here
with checks.

50
00:03:58,177 --> 00:04:03,876
But then comes the question of if you need
to do this a second time what happens?

51
00:04:03,876 --> 00:04:09,575
Your second, let us say load that has to
create a victim and this buffer is full.

52
00:04:09,575 --> 00:04:12,460
What do you do?
Or you could just stall.

53
00:04:12,940 --> 00:04:16,520
And wait that's that is a, that is a
pretty good option.

54
00:04:17,298 --> 00:04:22,894
So the, the, prob, this kind of what you
are saying here is the probability that

55
00:04:22,894 --> 00:04:29,055
you have two victims generated in a very
short period of time is low and this is

56
00:04:29,055 --> 00:04:34,524
actually a scheme that people do use.
They do not do anything special and the

57
00:04:34,524 --> 00:04:40,569
first victim that gets generated goes into
the right buffer, the second victim that

58
00:04:40,569 --> 00:04:45,129
gets generated just stalls, the pipe.
That is okay.

59
00:04:45,129 --> 00:04:53,020
You have higher performance, if you can
have the, subsequent read, basically.

60
00:04:53,060 --> 00:04:58,398
Go beyond the right buffer here and start
actually doing something in the memory

61
00:04:58,398 --> 00:05:01,453
system.
And if you want to do this, you need to,

62
00:05:01,453 --> 00:05:06,632
just like what we did in the previous
example, you're gonna have to check this

63
00:05:06,632 --> 00:05:11,679
right buffer to see if the data is there.
And that introduces complexity, because

64
00:05:11,679 --> 00:05:16,062
now your data can be here or here or you
are further out.

65
00:05:16,062 --> 00:05:25,364
So there is just more places to check.
Okay, so that is like the first half of

66
00:05:25,364 --> 00:05:28,881
the right buffers.
The second half of the right buffer, why

67
00:05:28,881 --> 00:05:34,280
we want to put a right buffer is if we
have a write-through cache.

68
00:05:35,060 --> 00:05:38,880
So we have been talking about write-back
caches which introduce victims.

69
00:05:39,280 --> 00:05:44,980
But, if you recall, a write-through cache,
every single store that happens,

70
00:05:45,320 --> 00:05:50,090
Discordance the data cash that the low
level L1 data cache and it also gets bring

71
00:05:50,090 --> 00:05:53,460
in into the next level of cash, we will
because it is writing through.

72
00:05:53,460 --> 00:05:56,260
So let us say you have a right through
from the L1 to the L2.

73
00:05:56,520 --> 00:06:01,668
And one of the challenges with this is you
might not have enough bandwidth, into the

74
00:06:01,668 --> 00:06:08,280
L2 cache to basically take in every single
store that occurs.

75
00:06:09,320 --> 00:06:15,182
So the solution in this is you can
actually put a right buffer here which

76
00:06:15,182 --> 00:06:21,678
will sort of buffer off some of this extra
store bandwidth and we will introduce a

77
00:06:21,678 --> 00:06:28,810
notion of a coalescing right buffer.
So this is a extra addition to a right

78
00:06:28,810 --> 00:06:35,836
buffer here that will actually merge
multiple stores through the same line.

79
00:06:35,836 --> 00:06:43,535
So let us say you have a store to address
five and a store to address six with a

80
00:06:43,535 --> 00:06:48,252
right through cache.
You do not wanna actually have to write

81
00:06:48,252 --> 00:06:51,260
sort of, two full cash lines out to the
L2.

82
00:06:51,260 --> 00:06:55,413
Instead what IP will do is they have
coalescing right buffers.

83
00:06:55,413 --> 00:07:00,589
So there is one right buffer here might
have multiple entries that holds the whole

84
00:07:00,589 --> 00:07:03,975
cache line.
And that first store will push that whole

85
00:07:03,975 --> 00:07:08,001
cache line out into here.
The second store will try to push the

86
00:07:08,001 --> 00:07:12,474
whole cache line out but you will notice
that it is for the same address that it

87
00:07:12,474 --> 00:07:16,819
already has in it so actually merge the
two caches line into one location.

88
00:07:16,819 --> 00:07:21,931
And what this does is decreases the
bandwidth that you need at the L2 cause it

89
00:07:21,931 --> 00:07:25,190
is very common in codes to write
sequential addresses.

90
00:07:25,190 --> 00:07:30,668
So it is if common to let us say you are,
you are adding to raise the destination

91
00:07:30,668 --> 00:07:35,600
array you will actually just be writing
address after address after address.

92
00:07:36,320 --> 00:07:43,495
And you do not want to actually have to go
fire up the L2 for every single store you

93
00:07:43,495 --> 00:07:48,973
do in that array operation,
If you have a write-through cache.

94
00:07:48,973 --> 00:07:54,840
So you can put a coalescing write buffer
here, to, to save bandwidth, into your L2.

95
00:07:56,360 --> 00:08:02,960
This, this, this is our, for our second
technique is having this right buffer.

96
00:08:04,180 --> 00:08:12,360
Okay, so what does the right buffer do?
Does it decrease our miss rate?

97
00:08:16,880 --> 00:08:22,800
Cache is the same size.
Associativity is the same.

98
00:08:23,640 --> 00:08:31,413
Probably not going to change our miss
rate.

99
00:08:31,413 --> 00:08:40,469
Miss penalty.
Okay,, raise your hand here Who thinks

100
00:08:40,469 --> 00:08:47,349
this affects the miss penalty?
Some people raising their hands.

101
00:08:47,349 --> 00:08:51,441
I think we should probably be all be
raising our hands, because that was really

102
00:08:51,441 --> 00:08:55,751
what we were trying to do with this whole
right buffer is to reduce the missed

103
00:08:55,751 --> 00:09:00,007
penalty, and the reason this reduces the
missed penalty is that reed misses in

104
00:09:00,007 --> 00:09:04,207
here.
It, it does not need to wait for the right

105
00:09:04,207 --> 00:09:08,296
to occur of the background data, or the
victim data.

106
00:09:08,296 --> 00:09:12,880
Instead, it can just have that happen in
the background.

107
00:09:13,400 --> 00:09:18,278
This also does not affect our hit time.
It may actually help our bandwidth.

108
00:09:18,278 --> 00:09:23,798
The reason it does not affect our hit time
is our L1 cache just will still work the

109
00:09:23,798 --> 00:09:28,933
same it works before you are still, can do
loads and stores against that and if you

110
00:09:28,933 --> 00:09:32,399
hit it is fine.
So it only affects miss sorts of things.

111
00:09:32,399 --> 00:09:37,149
Bandwidth like I said if you are a write
through cache this might actually

112
00:09:37,149 --> 00:09:42,220
effectively give you more bandwidth if you
have a coalescing right buffer.