1
00:00:03,027 --> 00:00:05,881
Okay.
So, an important question comes up with

2
00:00:05,881 --> 00:00:11,733
something like a superscalar when you're
executing multiple executions at a time is

3
00:00:11,733 --> 00:00:16,891
what happens when you fetch two
instructions and you, lets say one of

4
00:00:16,891 --> 00:00:23,207
those instructions takes an interrupt or
exception as its going down the pipeline.

5
00:00:23,207 --> 00:00:28,459
So, let's, let's take an example here.
Let's say, we have a load and then a

6
00:00:28,459 --> 00:00:32,590
system-call instruction.
Now, both these instructions can

7
00:00:32,590 --> 00:00:38,021
effectively take interrupts or exceptions.
The, the load can take something like a

8
00:00:38,021 --> 00:00:42,046
TOB miss, or alignment fault.
The SYSCALL instruction, by definition, is

9
00:00:42,046 --> 00:00:47,010
effectively, making an interrupt occur.
And one of the interesting questions here

10
00:00:47,010 --> 00:00:49,096
is this load word, which is going to go
down the pipe.

11
00:00:49,096 --> 00:00:54,017
If, we fetch these two at the same time,
and they start marching down the pipes.

12
00:00:54,017 --> 00:00:57,090
This is our pipeline diagram.
We fetch at the same time, we decode at

13
00:00:57,090 --> 00:01:00,081
the same time.
The load has to go to the A-pipe or the

14
00:01:00,081 --> 00:01:05,008
load has to go to the B-pipe, so it ends
up in B, and the SYSCALL ends up in the

15
00:01:05,008 --> 00:01:10,132
A-pipe.
Well, what does this exactly mean if the

16
00:01:10,132 --> 00:01:17,081
load is in the B pipeline, but it takes an
interrupt, and it commits in order first.

17
00:01:18,031 --> 00:01:22,043
Hm, well, actually, let's, let's think
about that even, let's, let's think of

18
00:01:22,043 --> 00:01:26,055
even a, a simpler question here.
What if the SYSC will load does not take

19
00:01:26,055 --> 00:01:37,805
any faults and the SYSCALL takes a fault.
Which happened first, A or B?

20
00:01:37,805 --> 00:01:41,226
Okay.
So, what should, should happen first in a

21
00:01:41,226 --> 00:01:43,800
program order?
A load and then, an instruction after a

22
00:01:43,800 --> 00:01:45,940
load.
A load should happen first because in

23
00:01:45,940 --> 00:01:53,319
program order we go sort of top to bottom.
But, the load is in our, our B-pipe here

24
00:01:53,319 --> 00:01:59,775
and our A-pipe, we have a instruction
which takes a interrupt.

25
00:01:59,775 --> 00:02:07,493
Well, so, what happens here, right, is
that the load should go down the pipe and

26
00:02:07,493 --> 00:02:13,155
complete in order not to deadlock your,
basically, your code logic is going to

27
00:02:13,155 --> 00:02:18,563
sort of either know about this or very,
very late in the pipe, you have to have

28
00:02:18,563 --> 00:02:22,747
what we're going to call a commit point,
which we're going to be talking about

29
00:02:22,747 --> 00:02:27,073
later in today's lecture, and you have to
make some rational decision here of which

30
00:02:27,073 --> 00:02:30,712
of these actually occurred in order and
you somehow have to track that going down

31
00:02:30,712 --> 00:02:33,517
the pipe.
And then, you have to make a decision of

32
00:02:33,517 --> 00:02:37,888
oh, well, the A-pipe actually just took an
interrupt, but we, in program order, the

33
00:02:37,888 --> 00:02:40,408
B-pipe is the first instruction going down
the pipe.

34
00:02:40,408 --> 00:02:44,146
So, at the end of the pipe there, you're
going to have to make some logical

35
00:02:44,146 --> 00:02:49,223
decision, and have to have a little bit of
control logic to make sure that you're not

36
00:02:49,223 --> 00:02:53,569
going to, let's say, take the interrupt
for the SYSCALL, even though, and, and

37
00:02:53,569 --> 00:02:56,078
kill the load construction before the
SYSCALL.

38
00:02:56,078 --> 00:03:00,556
So, one thing you could do is actually
have both of them go down the end of the

39
00:03:00,556 --> 00:03:05,047
pipe, not kill load, but a commit.
And have the SYSCALL actually take the

40
00:03:05,047 --> 00:03:08,029
interrupts.
That's probably the highest performing

41
00:03:08,029 --> 00:03:12,004
thing you can do in this case.
Lower performing things probably would be

42
00:03:12,004 --> 00:03:16,003
easier but that's probably the highest
performing thing you can do in this case.

43
00:03:17,068 --> 00:03:20,587
Okay.
So, we sort of introduce this two-way

44
00:03:20,587 --> 00:03:27,044
super scalarr in, or superscalar.
One thing we need to think about is we add

45
00:03:27,044 --> 00:03:33,057
a lot more places that data could be
coming from in, if we were to forward

46
00:03:33,057 --> 00:03:37,524
data.
So, we now, instead of if, if, when we

47
00:03:37,524 --> 00:03:42,368
have one pipeline, we could bypass them
out of here, here, and there.

48
00:03:42,368 --> 00:03:46,503
So, it's only three places.
But now that we have two pipelines, you

49
00:03:46,503 --> 00:03:51,585
effectively multiply the places that you
can bypass out of, and now you've got six

50
00:03:51,585 --> 00:03:54,096
places.
So, if you go sort of pull, pull steering

51
00:03:54,096 --> 00:03:58,644
logic off and then, you know, make your
multiplexors bigger here which are, you're

52
00:03:58,644 --> 00:04:03,137
doing, you're bypassing, you end up with
six different locations that you have to

53
00:04:03,137 --> 00:04:07,363
choose between for each input operand.
And this is a relatively short pipe.

54
00:04:07,363 --> 00:04:12,039
So, as you start to go to bigger and
bigger pipelines either in depth or in

55
00:04:12,039 --> 00:04:17,269
width, and you want full bypassing, you're
going to have more and more, much wider

56
00:04:17,269 --> 00:04:20,742
multiplexers here and a lot more data
being bypassed.

57
00:04:20,742 --> 00:04:24,546
So, this, this actually becomes, becomes a
problem that you need to start to, to

58
00:04:24,546 --> 00:04:28,543
think about this really hard.
So, some, what are some solutions to this?

59
00:04:28,543 --> 00:04:32,937
Well, one solution that people sometimes
do, is they don't have full bypassing.

60
00:04:32,937 --> 00:04:36,227
You can only bypass at certain locations.
That's, that's one option.

61
00:04:36,227 --> 00:04:41,046
Another option, which we will be talking
about a little bit later today, is you can

62
00:04:41,046 --> 00:04:45,553
actually, maybe, not actually have this
pipeline register and if you start to

63
00:04:45,553 --> 00:04:49,794
think of having out of order processors,
you could start to think of committing

64
00:04:49,794 --> 00:04:52,341
information back to the register file
early.

65
00:04:52,341 --> 00:04:55,482
So, this pipe here has nothing happening
in this stage.

66
00:04:55,482 --> 00:04:57,879
Well, couldn't we just shove it in the
register file?

67
00:04:57,879 --> 00:05:03,320
On first appearance that sounds great, you
start to think about that a little more

68
00:05:03,320 --> 00:05:07,554
and you're, you can start to get worried
here because you start to see write after

69
00:05:07,554 --> 00:05:11,751
write hazards actually start showing up as
real problems then, because if you issue

70
00:05:11,751 --> 00:05:16,574
an instruction here, which writes to the
same register or which happens at the end

71
00:05:16,574 --> 00:05:20,586
of this load operation, we'll say, then
you could actually get out of order

72
00:05:20,586 --> 00:05:23,985
writing to the register file.
So, you need to be cognizant of that.

73
00:05:23,985 --> 00:05:28,948
Other approaches that people take this, is
sometimes they will actually have what is

74
00:05:28,948 --> 00:05:32,555
called clustered superscalars.
So, clustered superscalars though,

75
00:05:32,555 --> 00:05:36,431
superscalars will actually have, let's
say, four pipelines, and they'll cluster

76
00:05:36,431 --> 00:05:40,777
them into two pipelines of two each, and
you'll allow bypassing between two of the

77
00:05:40,777 --> 00:05:45,507
pipes, and two of the other pipes, and if
you bypass between the two sets of

78
00:05:45,507 --> 00:05:50,265
clustered pipes, then it takes an extra
cycle or you'd have to do it through the

79
00:05:50,265 --> 00:05:55,054
register file or something like that.
So, there's other approaches there to try

80
00:05:55,054 --> 00:05:59,010
and mitigate the blowup of this bypassing
network.

81
00:05:59,010 --> 00:06:06,016
You have to remember, in something like a
64-bit, something like a 64-bit processor,

82
00:06:06,016 --> 00:06:10,004
each of these are 64-bit buzzes.
Each, each one of these little wires here,

83
00:06:10,004 --> 00:06:12,078
so these things get pretty, pretty big,
pretty quick.

84
00:06:12,078 --> 00:06:16,098
So, you have to worry about actually
running theses things around the chip, cuz

85
00:06:16,098 --> 00:06:21,007
all of a sudden you have hundreds and
hundreds of, you know, 600 bits running

86
00:06:21,007 --> 00:06:25,058
over [unknown] wires running up and over
just for your bypass from this simple

87
00:06:25,058 --> 00:06:28,098
pipeline, and if we go wider, it's going
to be much worse, or longer.

88
00:06:32,013 --> 00:06:37,028
So, one, one thing that people do a lot to
handle this bypassing, for a critical path

89
00:06:37,028 --> 00:06:42,616
perspective, cuz this takes a long time or
starts to take a long time is you start to

90
00:06:42,616 --> 00:06:47,090
break the decode and the issue.
So, we're going to sort of get away from

91
00:06:47,090 --> 00:06:53,058
our five-stage pipes now, up to this point
we've been doing things you've seen in the

92
00:06:53,058 --> 00:06:57,289
first Patterson-Hennessy book.
And now, we're going to start thinking

93
00:06:57,289 --> 00:07:02,087
about things that have longer pipelines.
So, one, one good thing to do is you can

94
00:07:02,087 --> 00:07:07,078
actually break the decode and the register
file access into two separate stages in

95
00:07:07,078 --> 00:07:10,092
the pipe effectively making a six-stage
pipeline now.

96
00:07:10,092 --> 00:07:14,618
And what do we, what do we put?
Well, one thing we can do is actually

97
00:07:14,618 --> 00:07:19,540
break the decode into its own pipe stage
and we can try to figure out structural

98
00:07:19,540 --> 00:07:24,007
hazards even in that pipe stage.
That's when people sort of traditionally

99
00:07:24,007 --> 00:07:27,807
do the decode and they also look to see if
you're going to have structural hazard,

100
00:07:27,807 --> 00:07:32,054
let's say, in the right port of the
register file that ends the pipe.

101
00:07:33,033 --> 00:07:38,846
And then, in the issue stage, I for issue,
you'll do the register file and you'll

102
00:07:38,846 --> 00:07:44,796
probably swizzle or cross over or steer
the instructions to the, and the operands

103
00:07:44,796 --> 00:07:49,099
to the correct location.
And, of course, you'll do this bypassing

104
00:07:49,099 --> 00:07:53,057
if you have lots of bypassing operands
going back.

105
00:07:53,057 --> 00:07:59,130
So, to give u a sort of a, a brief
pipeline example here, here we have two

106
00:07:59,130 --> 00:08:03,598
cycles and, each, execute two instructions
per cycle.

107
00:08:03,598 --> 00:08:10,577
And, we can see now, our pipeline has an
extra I in here, which is just an extra

108
00:08:10,577 --> 00:08:12,429
front end stage.
Okay.

109
00:08:12,429 --> 00:08:16,020
So, this has some, this has some negative
aspects.

110
00:08:16,020 --> 00:08:21,085
Can anyone think of one negative aspects
of putting extra pipe line stages in the

111
00:08:21,085 --> 00:08:23,071
front of our pipe line?
Yes.

112
00:08:23,071 --> 00:08:29,076
So, branches, if you, if you, if we know
that the branch gets resolved in outside

113
00:08:29,076 --> 00:08:34,833
of, out of the first execute stage of the
pipe or in our two pipe here, [unknown]

114
00:08:34,833 --> 00:08:37,643
A0.
We've just increased the branch cost by,

115
00:08:37,643 --> 00:08:41,077
by one.
So now, something that would have branched

116
00:08:41,077 --> 00:08:46,520
or, a, a branch, you know, mispredict
probably of, let's say, two cycles just

117
00:08:46,520 --> 00:08:50,092
became three cycles.
And this, this can start hurting your

118
00:08:50,092 --> 00:08:53,098
performance.
And, this really starts to hurt your

119
00:08:53,098 --> 00:08:59,008
performance, as you start to go wide.
So, let's take this instruction sequence

120
00:08:59,008 --> 00:09:04,046
here, where we have this extra issue stage
on front and we have a branch in the first

121
00:09:04,046 --> 00:09:07,374
instruction.
We try to execute and then we just have

122
00:09:07,374 --> 00:09:12,036
sort of fall through code here, which we
sort of, predict fall through.

123
00:09:12,036 --> 00:09:17,671
We don't realize that the branch happens
till A0, at that point, we can redirecting

124
00:09:17,671 --> 00:09:21,065
to sort of kill everything that we've
already done in flight.

125
00:09:21,065 --> 00:09:24,821
But look at all the things that have gone
in flight already.

126
00:09:24,821 --> 00:09:29,857
We're, we're sitting here, which means
we've had one, two, three other stages if

127
00:09:29,857 --> 00:09:33,704
you will, or three other cycles to go
fetch instructions.

128
00:09:33,704 --> 00:09:37,971
So, we fetch these, these, these, we
decode them, we did, we spent a lot of

129
00:09:37,971 --> 00:09:42,596
power, we spent a lot of time, we spent a
lot of fetch bandwidth doing this.

130
00:09:42,596 --> 00:09:47,592
And then, we just kill it all and
re-vector to the correct branch target.

131
00:09:47,592 --> 00:09:51,382
So, in this example here, we've killed
seven instructions.

132
00:09:51,382 --> 00:09:56,497
So, this can have a, a pretty negative
impact on our clocks per instruction, if

133
00:09:56,497 --> 00:10:01,301
you will.
So, let's, let's talk briefly about how to

134
00:10:01,301 --> 00:10:05,018
fix this.
We're not going to fix all this today, we

135
00:10:05,018 --> 00:10:08,351
have a whole dedicated lecture to fixing
this.

136
00:10:08,351 --> 00:10:14,293
But, what could we possibly do to minimize
the probability that there is, all these

137
00:10:14,293 --> 00:10:19,919
dead cycles here, all these killed
instructions going down our 2A pipe.

138
00:10:19,919 --> 00:10:25,571
Well, we can, we can, hopefully, if we are
lucky, we can try to predict the

139
00:10:25,571 --> 00:10:30,494
destination with some accuracy.
And have a branch predictor which will

140
00:10:30,494 --> 00:10:35,227
figure out where the destination of the
actual branch is with some high

141
00:10:35,227 --> 00:10:38,568
probability.
And then, instead of executing, let's say,

142
00:10:38,568 --> 00:10:43,871
Op A here which is a, a dead instruction
doing incorrect branch target, we can try

143
00:10:43,871 --> 00:10:47,299
to fetch and try to execute the correct
branch part.

144
00:10:47,299 --> 00:10:52,354
And we're going to have a whole talk,
we're going to have a whole lecture on how

145
00:10:52,354 --> 00:10:57,690
to get your branch prediction accuracy up.
So, in, in modern day processors, there's,

146
00:10:57,690 --> 00:11:01,387
you know, somewhere around 98%accurate,
give or take a little bit.

147
00:11:01,387 --> 00:11:06,318
I actually don't know what the state of
the art is on this because they, they,

148
00:11:06,318 --> 00:11:10,786
keep getting better and more complex
branch predictors, but there's pretty

149
00:11:10,786 --> 00:11:14,710
simple things you can do to sort of get
you into the mid '80s range.

150
00:11:14,710 --> 00:11:19,718
And then to get from, from the mid '80s
range of branch prediction accuracy to the

151
00:11:19,718 --> 00:11:23,321
'90s, you have to sort of put a lot of
effort and, and time into that.

152
00:11:23,321 --> 00:11:27,397
But, we'll have a whole lecture later in
today's, later in the, the course, about

153
00:11:27,397 --> 00:11:30,424
branch prediction, just dedicated to
branch prediction.

154
00:11:30,424 --> 00:11:34,219
But I just want to motivate here that, if
you have a longer front end of your pipe,

155
00:11:34,219 --> 00:11:38,731
and we are going to look at some pipelines
where there is even more front end stages

156
00:11:38,731 --> 00:11:43,264
than just fetch decode issue before the
branch gets resolved, whenever you add an

157
00:11:43,264 --> 00:11:47,819
extra pipe stage in the front, that's
going to impact your performance.

158
00:11:47,819 --> 00:11:53,058
Because even if you have a high prediction
accuracy, your prediction accuracy is not

159
00:11:53,058 --> 00:11:56,004
going to be 100%.
And, if you mispredict, you sure going to

160
00:11:56,004 --> 00:12:00,383
have dead instructions going on the pipe,
as this can be wasting time, energy, and

161
00:12:00,383 --> 00:12:02,053
utilization of the pipeline.