1
00:00:03,033 --> 00:00:12,943
Let's move on to memory disambiguation.
So, this is kind of the analog of read

2
00:00:12,943 --> 00:00:18,078
after write hazards through the memory
system.

3
00:00:22,085 --> 00:00:23,284
Whoops.
Okay.

4
00:00:23,284 --> 00:00:31,067
We have a basic instruction sequence here.
We have a store, followed by a load.

5
00:00:31,096 --> 00:00:36,042
When can we execute the load?
Well, that's a tough one.

6
00:00:36,042 --> 00:00:42,008
If the load's not dependent on the store,
we can just go execute out of order.

7
00:00:42,008 --> 00:00:46,590
We can potentially even execute the load
before the store, in an out of order

8
00:00:46,590 --> 00:00:52,115
machine.
If the load's dependend on the store, so

9
00:00:52,115 --> 00:00:57,589
if let's say, register two is equal to
register four, then we're going to have a

10
00:00:57,589 --> 00:01:01,956
problem.
So, what's, how do, how do we go about

11
00:01:01,956 --> 00:01:07,761
trying to solve this?
Up, up until this point we, in our pipes

12
00:01:07,761 --> 00:01:15,278
we've said, okay, loads in stores are in
order cuz that, that can solve these

13
00:01:15,278 --> 00:01:18,571
problems.
So, we, let's say, we just execute all

14
00:01:18,571 --> 00:01:24,089
those in stores in order and we can still
allow other instructions to move around

15
00:01:24,089 --> 00:01:27,009
it.
So, it's out of order relative to other

16
00:01:27,009 --> 00:01:30,051
instructions, but we have in-order memory
instructions.

17
00:01:30,051 --> 00:01:34,070
And, that's going to guarantee that we're
not going to have any problems.

18
00:01:35,079 --> 00:01:38,083
Well, we're leaving performance on the
table here.

19
00:01:38,083 --> 00:01:42,043
We could go faster.
We could actually execute loads and stores

20
00:01:42,043 --> 00:01:49,490
out of order if R2 and R4 are different.
But, we're probably going to need some

21
00:01:49,490 --> 00:01:53,011
extra structure to do that.
Okay.

22
00:01:53,011 --> 00:01:59,051
So, say, we look at a conservative
out-of-order load example.

23
00:02:00,006 --> 00:02:07,041
So, our, our conservative one, let's split
the store into two sort of sub-parts.

24
00:02:07,041 --> 00:02:12,009
An address computation, and the actual
data write.

25
00:02:13,002 --> 00:02:20,680
By doing this, we can guarantee that the
store, because we do the address

26
00:02:20,680 --> 00:02:25,913
computation early, we can know that, that
subsequent load is not going to alias, or

27
00:02:25,913 --> 00:02:29,252
we know the R4 is not going to be equal to
R2.

28
00:02:29,252 --> 00:02:33,080
And, we can go execute the load before the
store in this case.

29
00:02:33,080 --> 00:02:39,012
By splitting the store into two different
parts, the actual store portion and the

30
00:02:39,012 --> 00:02:44,557
address computation.
And, we still need to, unfortunately,

31
00:02:44,557 --> 00:02:51,145
check every single load against all
previous uncommitted stores or need some

32
00:02:51,145 --> 00:02:57,233
structures to go do that.
So, instead of a, instead of a kind of

33
00:02:57,233 --> 00:03:03,003
problem, cuz it's a bummer.
And, we're not able to execute any load if

34
00:03:03,003 --> 00:03:08,196
we don't know the store address of any of
the stores before us because we're not,

35
00:03:08,196 --> 00:03:14,485
we're just not going to know whether we
can go execute that or not.

36
00:03:14,485 --> 00:03:22,572
Okay, well, can we do one better?
Let's start speculating because let's,

37
00:03:22,572 --> 00:03:30,008
let's guess that these are not equal.
Let's execute the load before the store.

38
00:03:32,074 --> 00:03:40,034
And then, we hold off committing the loads
and store instructions, and we do that in

39
00:03:40,034 --> 00:03:45,036
order.
And, if something bad happens, so if the

40
00:03:45,036 --> 00:03:54,053
registers do equal each other, we have to
replay all of these instructions.

41
00:03:55,041 --> 00:03:59,018
And what's, what's annoying is we also
have to kill all of the subsequent

42
00:03:59,018 --> 00:04:01,097
instructions.
So, anything that was after that load and

43
00:04:01,097 --> 00:04:05,728
store which could have picked up a value
from loading the store, or, well, rather,

44
00:04:05,728 --> 00:04:08,352
from the load.
Any dependent instructions the load, we're

45
00:04:08,352 --> 00:04:11,921
going to have to go kill all that.
And, depending on how precise we want to

46
00:04:11,921 --> 00:04:16,226
be, we can either try to kill just
everything after it, or we can try to kill

47
00:04:16,226 --> 00:04:19,354
just the selectively, the things which are
dependent on it.

48
00:04:19,354 --> 00:04:24,459
I think it's really hard to track.
And, and we have a pretty high penalty

49
00:04:24,459 --> 00:04:29,496
here for inaccurate address speculation.
So, if we go and we just execute these two

50
00:04:29,496 --> 00:04:34,187
instructions, we put them into the pipe
and they were dependent on each other, we

51
00:04:34,187 --> 00:04:43,929
have to rollback a lot of state.
So, one, one, one sort of heuristic on

52
00:04:43,929 --> 00:04:52,042
this is, was, was done in the Alpha 21264,
is called Memory Dependence Prediction.

53
00:04:52,042 --> 00:04:58,088
So, what you do is you're still going to
guess that the loads in the stores do not

54
00:04:58,088 --> 00:05:03,047
alias.
But, then later, if you find that the two

55
00:05:03,047 --> 00:05:08,097
registers do equal each other, you're
going to squash all the instructions.

56
00:05:08,097 --> 00:05:12,054
But then, you're going to do something
special to that load.

57
00:05:13,007 --> 00:05:17,333
You basically going to say, that's a,
that's a, wait, wait for the store, you

58
00:05:17,333 --> 00:05:21,088
sort of put barrier in.
So, you're not going to get multiple roll

59
00:05:21,088 --> 00:05:26,012
backs over and over again.
So, this is kind of a conservative way

60
00:05:26,012 --> 00:05:29,813
that when you see you're rolling back a
lot, there's a lot of thrown away work

61
00:05:29,813 --> 00:05:33,017
that you're doing there.
So instead, you can just say, well, this

62
00:05:33,017 --> 00:05:37,362
load, I think is dependent on that store.
You can potentially, even remember that

63
00:05:37,362 --> 00:05:41,264
across instruction sequences.
So, if it's always the case, let's say,

64
00:05:41,264 --> 00:05:45,307
this load is dependent on the store, you
could potentially keep it into your

65
00:05:45,307 --> 00:05:49,616
instruction cache or something like that.
And then, you know, when you go to execute

66
00:05:49,616 --> 00:05:54,112
that same load at a different time, it's
going to wait for all the stores to

67
00:05:54,112 --> 00:05:58,141
complete and it's going to barrier.
This is kind of a prediction.

68
00:05:58,141 --> 00:06:03,640
It's just heuristic, but it's one way you
could potentially not cause that load to

69
00:06:03,640 --> 00:06:06,538
always replay.
And, if you have sort of multiple

70
00:06:06,538 --> 00:06:10,743
dependent loads or, or, excuse me, it
loads in one store you could potentially,

71
00:06:10,743 --> 00:06:14,076
cause those loads not to have to replay
multiple times.

72
00:06:14,076 --> 00:06:19,061
But, the, but the big advantage here is
when you go to re-execute that load some

73
00:06:19,061 --> 00:06:24,093
time in the future, you're not going to
flush the entire pipe.

74
00:06:24,093 --> 00:06:28,091
So, this is kind of like a branch
prediction, if you will.

75
00:06:28,091 --> 00:06:34,051
It's like, this branch I've, I've trained
the predictor to say that, that load

76
00:06:34,051 --> 00:06:37,527
usually is dependent on the previous
store.

77
00:06:37,527 --> 00:06:42,180
So, just wait for all the other stores to
clear out of the pipe.

78
00:06:42,180 --> 00:06:45,771
So, it's a cute little trick.
Okay.

79
00:06:45,771 --> 00:06:51,371
So, we're almost out of time.
All right.

80
00:06:51,371 --> 00:07:00,660
So, speculative loads and stores.
We're going to introduce a store buffer to

81
00:07:00,660 --> 00:07:05,497
hold the speculative state.
So, let's take a look very briefly here,

82
00:07:05,497 --> 00:07:09,010
I'll flash up what a store buffer looks
like.

83
00:07:10,037 --> 00:07:13,079
And then, we'll think about it for a
second here.

84
00:07:13,079 --> 00:07:17,069
So, here you have your cache, your L1 data
cache we'll say.

85
00:07:18,062 --> 00:07:23,004
We're going to send all the addresses to
go to the L1 cache, also to this other

86
00:07:23,004 --> 00:07:26,001
structure here which is the speculative
store buffer.

87
00:07:26,043 --> 00:07:33,319
Inside of the speculative store buffer,
there is going to be bits basically saying

88
00:07:33,319 --> 00:07:39,011
whether it's valid or not.
And, the load is going to check against

89
00:07:39,011 --> 00:07:42,001
here.
And, let's say, the data hasn't actually

90
00:07:42,001 --> 00:07:47,549
gone into the cache, it can go pick up the
new value here.

91
00:07:47,549 --> 00:07:53,024
And, the S bit is basically, it's going to
tell us whether it's in flight or not.

92
00:07:53,024 --> 00:07:56,087
So, it's possible that we allocate into
here, but the store isn't, at the end of

93
00:07:56,087 --> 00:07:59,007
the pipe yet or we're still calculating
the data.

94
00:07:59,007 --> 00:08:02,061
So, we need to know that we're going to
get a hit if we get a load out of order,

95
00:08:02,061 --> 00:08:05,093
and we need to go check for that.
And that's, that's really why we have two

96
00:08:05,093 --> 00:08:13,797
bits here.
And then, sometime in the future, we have

97
00:08:13,797 --> 00:08:18,080
to eventually move the data into the data
cache cuz this structure will get full.

98
00:08:18,080 --> 00:08:22,268
If we sort of fire enough stores against
it, it's going to get full.

99
00:08:22,268 --> 00:08:26,233
So, it's probably a good idea to go move
the data into the data cache at some

100
00:08:26,233 --> 00:08:29,075
point.
We can do that at our convenience, though.

101
00:08:30,089 --> 00:08:40,040
And, if the store aborts, this is
important here, so if the store let's say,

102
00:08:40,040 --> 00:08:45,500
was speculative of itself, and, and some
of the wrong side of the branch or

103
00:08:45,500 --> 00:08:48,063
something like that, we just remove it
from this table.

104
00:08:53,029 --> 00:09:00,093
One interesting thing you hear is, if the
data is in both this buffer and in the

105
00:09:00,093 --> 00:09:05,093
cache, which one takes precedence?
So, the store buffer has the new, new

106
00:09:05,093 --> 00:09:10,492
values, so we have to look there.
And then, finally, it's possible to get

107
00:09:10,492 --> 00:09:15,081
multiple stores in here which have the
same tag, the same address.

108
00:09:15,081 --> 00:09:20,046
Is this how you're storing to the same
address over and over again?

109
00:09:20,082 --> 00:09:26,037
And when you go to read, you don't need to
go read the youngest value out of there

110
00:09:26,037 --> 00:09:31,051
cuz that's the most up to date one.
So, this, look up through here is a little

111
00:09:31,051 --> 00:09:34,669
complicated.
You actually need to do it in the program

112
00:09:34,669 --> 00:09:38,036
order.
Okay.

113
00:09:38,036 --> 00:09:52,034
So, we'll, we'll stop here for today.