1
00:00:03,031 --> 00:00:09,033
And now we're going to move on to talking
about basic pipelines, and we will start

2
00:00:09,033 --> 00:00:13,698
off by just talking about how to build a
pipeline and how to name things in a

3
00:00:13,698 --> 00:00:18,484
pipeline, and how instructions flow down a
pipeline, and why you want to build the

4
00:00:18,484 --> 00:00:21,037
pipeline.
And we start off with a very gentle

5
00:00:21,037 --> 00:00:25,087
introductions to this, instead of my
talking about, a idealized pipeline, not

6
00:00:25,087 --> 00:00:29,058
necessarily in a processor.
Pipelines show up many places in the

7
00:00:29,058 --> 00:00:31,084
world.
They show up in car factories, toy

8
00:00:31,084 --> 00:00:34,067
factories.
I know some people will say, you know,

9
00:00:34,067 --> 00:00:39,025
when you go use the washing machine in a
laundry-mat you're pipelining cause you

10
00:00:39,025 --> 00:00:43,021
first put your, laundry in the washing
machine and then take it out.

11
00:00:43,021 --> 00:00:47,073
Then you can put it in the dryer, we can
put more work in the, or more laundry in

12
00:00:47,073 --> 00:00:51,035
the first washing machine.
So we can, we see pipelines in lots of

13
00:00:51,035 --> 00:00:55,025
different places in the world.
But we also see it, in, in this class

14
00:00:55,025 --> 00:00:58,053
we're gonna care about using pipelines in
microprocessors.

15
00:00:58,053 --> 00:01:02,063
So let's think about an idealized
pipeline.

16
00:01:02,063 --> 00:01:06,015
So I have, have a picture here of an
idealized pipeline.

17
00:01:06,033 --> 00:01:09,098
What are some of the good, or what are
some of the requisite.

18
00:01:09,098 --> 00:01:14,055
Conditions for an idealized pipeline.
Well, first of all, all work, or all

19
00:01:14,055 --> 00:01:19,070
objects should go through all the stages.
So here we have four stages in this

20
00:01:19,070 --> 00:01:25,011
pipeline, and there's no sort of squiggly
lines where things sort of come out of the

21
00:01:25,011 --> 00:01:28,046
pipeline, or go around the pipeline, or
exit early.

22
00:01:28,046 --> 00:01:33,081
It doesn't happen in this, this diagram.
So, in an idealized pipeline, you want all

23
00:01:33,081 --> 00:01:36,065
of the objects to go through all the
stages.

24
00:01:36,065 --> 00:01:40,058
You don't want any resources shared
between different stages.

25
00:01:40,058 --> 00:01:45,086
So you don't want some resource here which
is stared, shared between stage two and

26
00:01:45,086 --> 00:01:51,060
stage three So an example of this in a car
assembly pipe pipeline or assembly line

27
00:01:51,060 --> 00:01:56,095
would be one tool which two different
stages in the pipeline have to use or two

28
00:01:56,095 --> 00:02:02,050
different stages have to use that would
cause what is known as a structural hazard

29
00:02:02,050 --> 00:02:06,044
in processor design.
The propagation delay between all the

30
00:02:06,044 --> 00:02:09,074
different, of all the different stages
should be the same.

31
00:02:09,074 --> 00:02:14,036
So the time taken for stage one should be
the same as two, stage two, the same as

32
00:02:14,036 --> 00:02:17,083
stage three, the same as stage four in an
idealized pipeline.

33
00:02:17,083 --> 00:02:23,096
And then, finally, the scheduling of an
operation, or a transaction, or an object

34
00:02:23,096 --> 00:02:29,077
going down the pipeline should not be
affected by what's currently in the

35
00:02:29,077 --> 00:02:33,062
pipeline.
And, these conditions actually hold true

36
00:02:33,062 --> 00:02:40,038
for most assembly lines in plants where
people go and build cars or something like

37
00:02:40,038 --> 00:02:44,080
that.
There's no dependence between the

38
00:02:44,080 --> 00:02:47,071
different parts.
A car comes into it.

39
00:02:47,071 --> 00:02:51,038
All cars are sort of, separate from each
other.

40
00:02:51,038 --> 00:02:56,031
It's not, not really a problem.
Unfortunately, if you go and look at

41
00:02:56,031 --> 00:03:00,087
something like a microprocessor and
executing instructions.

42
00:03:00,087 --> 00:03:04,084
Instructions actually depend on earlier
instructions.

43
00:03:04,084 --> 00:03:07,072
And they depend on it in multiple
different ways.

44
00:03:07,072 --> 00:03:12,073
They can either depend on the data values
or they can depend on the fact that you

45
00:03:12,073 --> 00:03:15,032
took a branch or a control flow
instruction.

46
00:03:15,050 --> 00:03:20,021
So we have to look at these different
hazards and we have to think about how to

47
00:03:20,021 --> 00:03:25,015
either solve them or deal with them and
how we have to, to, we have to think about

48
00:03:25,015 --> 00:03:34,312
how to deal with non-ideal pipelines.
So let's go one step beyond our microcoded

49
00:03:34,312 --> 00:03:40,662
disk processor design and look at a
un-pipelined data path for the MIPS

50
00:03:40,662 --> 00:03:44,058
instruction set.
So this is going to be similar.

51
00:03:44,058 --> 00:03:49,059
Or if you have already taken a computer
organization class, you've probably seen

52
00:03:49,059 --> 00:03:54,063
similar sorts of unpipelined data paths.
So, when on a unpipeline data path, where

53
00:03:54,063 --> 00:03:58,831
instead of the program counter here, we
have some instruction, memory where we go

54
00:03:58,831 --> 00:04:03,064
and fetch our instruction out.
And, this flows through, we fetch our

55
00:04:03,064 --> 00:04:07,082
registers from the register file.
We flow through, we do the actual

56
00:04:07,082 --> 00:04:13,002
computation, we might have to go access,
data from, data memory if we do a load or

57
00:04:13,002 --> 00:04:15,075
restore, and then finally comes back
around.

58
00:04:15,075 --> 00:04:20,031
We write the result here and we increment
or change the program counter.

59
00:04:20,068 --> 00:04:28,710
And, this is all done, in one cycle.
And it's a very long cycle, cuz you launch

60
00:04:28,710 --> 00:04:33,923
here, you go through all this work.
You do all these things.

61
00:04:33,923 --> 00:04:39,552
The data gets put back here, and you have
to, it has to happen because we're

62
00:04:39,552 --> 00:04:44,683
unpipelined, we have an unpipelined
processor here, it has to happen in one

63
00:04:44,683 --> 00:04:48,599
cycle.
So this is kind of, not great from a cycle

64
00:04:48,599 --> 00:04:52,670
time perspective.
It is good from a perspective of the

65
00:04:52,670 --> 00:04:56,999
number of cycles per instruction.
It's always one because you launch

66
00:04:56,999 --> 00:05:02,213
instruction, you wait until it's done,
then you launch the next instruction, you

67
00:05:02,213 --> 00:05:07,479
launch the next instruction, and each
instruction it launches is one cycle.

68
00:05:07,479 --> 00:05:11,968
So let's, let's simplify this design, a
little bit so we can talk about what's

69
00:05:11,968 --> 00:05:16,014
going on here in a little bit more depth.
So, we're gonna simplify our un-pipelined

70
00:05:16,014 --> 00:05:20,546
processor design and just take out some of
the, the extra stuff here and focus on,

71
00:05:20,546 --> 00:05:24,574
you learn from the program counter, it
acts as the instruction memory all

72
00:05:24,574 --> 00:05:28,066
combinationly, you go through the
registered file combinational logic, you

73
00:05:28,066 --> 00:05:30,760
go through the data memory, and you come
back around.

74
00:05:30,760 --> 00:05:35,654
So, that would be something like a worst
case load instruction, and, and the value

75
00:05:35,654 --> 00:05:39,716
gets run back here into the general
purpose register file.

76
00:05:39,716 --> 00:05:46,263
So let's now move on to thinking about how
do we build the same data path, but we try

77
00:05:46,263 --> 00:05:51,634
run a pipe line it now.
Well the first thing we want to do is we

78
00:05:51,634 --> 00:05:54,759
want to cut the pipeline up or cut our
design up.

79
00:05:54,759 --> 00:06:01,055
And we are gonna put in registers.
And, it just, so happens that this is a

80
00:06:01,055 --> 00:06:06,653
convenient place to put registers for
something like a MIPS datapath.

81
00:06:06,653 --> 00:06:12,942
And, some important things that you might
recall from your digital logic design

82
00:06:12,942 --> 00:06:18,605
course is that when you go to pipeline a
circuit, you need to make sure that you

83
00:06:18,605 --> 00:06:23,642
cut all lines which are going in this
direction across the registers.

84
00:06:23,642 --> 00:06:30,063
So if we were to cut here, we need to put
a register here, here and here cuz there's

85
00:06:30,063 --> 00:06:33,442
three sets of wires running from left to
right.

86
00:06:33,442 --> 00:06:38,555
This feedback path here doesn't need a
pipeline register cuz it's, it's flowing

87
00:06:38,555 --> 00:06:42,338
back and it's effectively running back
into the register file here

88
00:06:42,338 --> 00:06:47,565
combinationally.
And you know, the register file can either

89
00:06:47,565 --> 00:06:51,760
be clocked or unclocked depending on how
you actually do this design.

90
00:06:51,760 --> 00:06:56,687
So we've pipelined our data path now.
And we'll, let's put some names on these

91
00:06:56,687 --> 00:07:00,088
things, that we're gonna be re-using
throughout the class.

92
00:07:00,088 --> 00:07:04,639
So we're gonna call this the, the fetch
phase or the, the instruction fetch time.

93
00:07:04,639 --> 00:07:09,145
Sometimes in this class we'll either
denote this with a upper case F, or we'll

94
00:07:09,145 --> 00:07:11,441
denote this with I F, for instruction
fetch.

95
00:07:11,441 --> 00:07:17,123
The next thing we're going to do is we're
going to decode the instruction, and we're

96
00:07:17,123 --> 00:07:21,587
also going to fetch the registers out of
the register file.

97
00:07:21,587 --> 00:07:27,325
So the decode instruction this is how we
take the tightly packed bits in the

98
00:07:27,325 --> 00:07:30,519
instruction and we blow it up into control
wires.

99
00:07:30,519 --> 00:07:36,152
And the register fetch phase we're
actually going to fetch from the register

100
00:07:36,152 --> 00:07:42,169
file and this is sometimes we'll denote
with an upper case D, or sometimes we'll

101
00:07:42,169 --> 00:07:46,716
denote it with RF for register fetch.
This is the execute phase.

102
00:07:46,716 --> 00:07:50,463
We're actually doing real work here.
We're taking the ALU.

103
00:07:50,463 --> 00:07:54,715
We're doing some add, we're doing some
multiply, something like that.

104
00:07:54,715 --> 00:08:00,364
Maybe some subtract or a shift operation,
and we'll denote this with an upper case X

105
00:08:00,364 --> 00:08:05,537
or a, the letters E X, for execution.
The memory phase we're accessing the

106
00:08:05,537 --> 00:08:10,652
memory here, the data memory and then
finally the write back phase, and we'll

107
00:08:10,652 --> 00:08:13,482
denote this with m and then finally, or
mem.

108
00:08:13,482 --> 00:08:19,037
And then finally the write back stage here
we'll typically denote with wb or w to

109
00:08:19,037 --> 00:08:24,003
denote when we actually go to write the
data into the register file.

110
00:08:24,003 --> 00:08:27,618
We have a whole pipeline stage dedicated
to that.

111
00:08:27,618 --> 00:08:34,535
So, one of the, the important things, now,
I, I just picked five different places

112
00:08:34,535 --> 00:08:38,776
here to pipeline this, this design, but,
this is not accidental.

113
00:08:38,776 --> 00:08:44,404
What we're trying to do is we're trying to
divide what was a long, a, a long time

114
00:08:44,404 --> 00:08:47,520
period, and cut it up into smaller chunks
here.

115
00:08:47,520 --> 00:08:53,084
So we're trying to reduce our clock
period, and we can do this by dividing the

116
00:08:53,084 --> 00:08:59,007
execution up into multiple cycles.
And if we look at the time, the cycle time

117
00:08:59,007 --> 00:09:04,056
is greater than the maximum, or greater
than or equal to the maximum of the time

118
00:09:04,056 --> 00:09:09,037
to access the instruction memory.
The time to access the register file.

119
00:09:09,037 --> 00:09:13,063
The time to access the ALU.
The time to access the data memory.

120
00:09:13,063 --> 00:09:17,069
And a time to write back.
And those are the different things in the

121
00:09:17,069 --> 00:09:19,087
five stages here.
So we're trying to.

122
00:09:20,027 --> 00:09:24,775
We're not, we're not trying to do anything
here, but the, the time taken is greater

123
00:09:24,775 --> 00:09:29,029
than to or equal to the maximum of the
rest of these times.

124
00:09:29,083 --> 00:09:34,097
And it's probably, in most pipelines,
equal to the time it takes to access the

125
00:09:34,097 --> 00:09:38,058
data memory.
The, the time to access the data memory in

126
00:09:38,058 --> 00:09:43,142
most, most processors is, is probably
your, gong to be your limiting factor, but

127
00:09:43,142 --> 00:09:48,044
people will try to pipeline this, and
we'll talk about how to pipeline that even

128
00:09:48,044 --> 00:09:51,478
more, so try to cut the memory in it in
half.

129
00:09:51,478 --> 00:09:56,659
Okay, so that's the easy part.
We pipeline the data, so let's now move

130
00:09:56,659 --> 00:10:00,709
onto a little bit more challenging
question here of how to pipeline the

131
00:10:00,709 --> 00:10:03,475
control.
So we're going to sort of look at our

132
00:10:03,475 --> 00:10:07,616
control unit and we're going to put down
our hardwired control here.

133
00:10:07,616 --> 00:10:12,402
So the hardwired control is gonna take the
instruction register and decode the

134
00:10:12,402 --> 00:10:16,173
instruction that comes in.
Hm.

135
00:10:16,173 --> 00:10:22,632
Well.
That's, that's okay.

136
00:10:22,632 --> 00:10:30,869
One of the problems here.
When you start to look at this, is, well.

137
00:10:30,869 --> 00:10:37,389
Well actually, before we do that, let's,
let's first look at how to use this

138
00:10:37,389 --> 00:10:42,002
processor, in a multi cycle yet pipelined
fashion.

139
00:10:42,002 --> 00:10:46,049
So what do I mean by that?
Well, let's say we're gonna.

140
00:10:46,049 --> 00:10:52,065
Use the same data path, and use the exact
same control unit we had in the

141
00:10:52,065 --> 00:10:57,564
unpipelined MIPS processor.
And in this design, the cycle time is now,

142
00:10:57,564 --> 00:11:04,434
lets say close to one-fifth of the cycle
time we had before we add five pipeline

143
00:11:04,434 --> 00:11:08,020
stages.
So our processor executes faster, But

144
00:11:10,036 --> 00:11:13,019
we're not going to pipeline our control
unit.

145
00:11:13,019 --> 00:11:18,029
So what this means is we're going to fetch
the first instruction, and we're going to

146
00:11:18,029 --> 00:11:22,020
have to wait for the instruction to go all
the way to the end.

147
00:11:22,098 --> 00:11:28,045
Write the register file and be done.
So what we just did here is we built a

148
00:11:28,045 --> 00:11:33,063
multi cycle yet pipelined processor.
And this multi cycle yet pipelined

149
00:11:33,063 --> 00:11:39,033
processor has the clock period which is
say, which is roughly one fifth of the

150
00:11:39,033 --> 00:11:45,002
time of the unpipelined one, but it now
takes five cycles.

151
00:11:45,002 --> 00:11:49,293
So it's a wash.
It's might be, it's probably a little bit

152
00:11:49,293 --> 00:11:54,651
worse cuz of some overhead, from the
instruction overhead from the registers

153
00:11:54,651 --> 00:11:59,068
that you need to insert, but it's going to
roughly, roughly be a wash.

154
00:11:59,068 --> 00:12:05,020
So let's say we want to start building a
pipeline processor that can actually

155
00:12:05,020 --> 00:12:10,078
execute multiple instructions at the same
time in different locations in the

156
00:12:10,078 --> 00:12:14,034
pipeline.
In order to do this we need to go about

157
00:12:14,034 --> 00:12:20,005
adding pipeline registers to our control.
And by adding pipeline registers to our

158
00:12:20,005 --> 00:12:24,029
control wires we can have multiple
instructions flowing down the pipeline,

159
00:12:24,029 --> 00:12:28,031
and where these multiple instructions are
in different stages in the pipe.

160
00:12:28,031 --> 00:12:33,007
So we can have one instruction here and a
different instruction here and they have

161
00:12:33,007 --> 00:12:37,083
different control wires being asserted on
them because we've pipelined the control

162
00:12:37,083 --> 00:12:43,033
wires.
So now we're gonna start talking about how

163
00:12:43,033 --> 00:12:47,027
do we go about analyzing the performance
of micro processors.

164
00:12:47,027 --> 00:12:52,019
And all performance basically boils down,
this is just for performance.

165
00:12:52,019 --> 00:12:57,825
There is similar sorts of thing for power
and area and other metrics you might want

166
00:12:57,825 --> 00:13:02,708
to care about when optimizing a processor.
But in today's lecture, we're going to

167
00:13:02,708 --> 00:13:08,091
focus on optimizing for performance.
There comes up this idea called the iron

168
00:13:08,091 --> 00:13:14,430
law of processor performance and it's
this, excuse me, we talked about a bunch

169
00:13:14,430 --> 00:13:19,822
in the first chapter of the computer
architecture or a quantitative approach

170
00:13:19,822 --> 00:13:25,508
book, and it boils down to the very basic
formula where the time to execute the

171
00:13:25,508 --> 00:13:29,554
program.
So this is the time it takes to run your

172
00:13:29,554 --> 00:13:35,318
program that you're trying to execute is
equal to how many instructions your

173
00:13:35,318 --> 00:13:46,001
program has, times the cycles that it
takes to execute one instruction for the

174
00:13:46,001 --> 00:13:50,604
program, multiplied by, the time per
cycle.

175
00:13:50,604 --> 00:13:55,591
So as you see it's multiplicative, so if
you can actually make any of these numbers

176
00:13:55,591 --> 00:13:59,668
smaller the time for the program will go
down, which is a good thing.

177
00:13:59,668 --> 00:14:04,921
So if you can make any of these numbers
smaller when it multiplies out we can make

178
00:14:04,921 --> 00:14:09,608
the time of the program smaller.
And what you also note here is they're all

179
00:14:09,608 --> 00:14:14,302
equally weighted, so if you change any of
them that's a good thing.

180
00:14:14,302 --> 00:14:20,452
One of the things that we should probably
do is think about the dimensional

181
00:14:20,452 --> 00:14:24,361
analysis.
Or, what is the unit on the time for the

182
00:14:24,361 --> 00:14:26,281
program?
Hm.

183
00:14:26,281 --> 00:14:33,459
Well, let's derive it from the three
things on the right side of this equation.

184
00:14:33,459 --> 00:14:37,617
Time per cycle.
So what's the unit on time per cycle?

185
00:14:37,617 --> 00:14:43,077
It's going to be seconds per cycle.
Or some time unit, or, or nanoseconds per

186
00:14:43,077 --> 00:14:47,273
cycle, well, let's, let's stick to seconds
per cycle.

187
00:14:47,273 --> 00:14:51,791
Cycles per instruction.
Okay what's the unit on that.

188
00:14:51,791 --> 00:14:54,973
Well the unit actually is cycles per
instruction.

189
00:14:54,973 --> 00:15:00,712
So if we do the dimensional analysis on
this cycles up here and cycle down there

190
00:15:00,712 --> 00:15:05,580
are going to cancel.
And then finally we have instructions per

191
00:15:05,580 --> 00:15:10,056
program and the unit on that actually is
instructions per program.

192
00:15:10,056 --> 00:15:15,969
So the number of instructions in your
program divided by how one the program.

193
00:15:15,969 --> 00:15:21,893
And if we, you know, multiply this out the
unit we're gonna get here for time per

194
00:15:21,893 --> 00:15:27,396
program is seconds per program or time per
program that actually is the unit.

195
00:15:27,396 --> 00:15:33,488
Okay so let's think about where these
different portions come from in how we can

196
00:15:33,488 --> 00:15:39,040
effect these three different quantities.
So instructions per program, can that the

197
00:15:39,040 --> 00:15:41,542
affected?
You just, you might say well, my program

198
00:15:41,542 --> 00:15:44,013
takes 100 instructions, and that's what it
takes.

199
00:15:44,013 --> 00:15:47,874
Well the world's not that simple.
If you had a different instruction set

200
00:15:47,874 --> 00:15:51,176
architecture, you might have different
numbers of instructions.

201
00:15:51,176 --> 00:15:54,907
So if you have something like the MIPS
instruction set, which is a reduced

202
00:15:54,907 --> 00:15:59,056
instruction set, you might actually have
to take more instructions, to execute a

203
00:15:59,056 --> 00:16:03,039
program, than if you were executing on
something like a X86 processor, where you

204
00:16:03,039 --> 00:16:07,080
have very complex instructions, or rather
where you have like one instruction that

205
00:16:07,080 --> 00:16:10,547
can do a whole string copy.
On something like MIPS, that would take

206
00:16:10,547 --> 00:16:13,576
thousands of instructions.
So the ISA really matters.

207
00:16:13,576 --> 00:16:17,584
The other thing, the other thing that can
matter here is the compiler.

208
00:16:17,584 --> 00:16:22,061
If you had a smarter compiler the compiler
might be able to reduce or eliminate code

209
00:16:22,061 --> 00:16:26,613
which is redundant or not very important,
or not important at all rather, or it can

210
00:16:26,613 --> 00:16:31,229
try to merge two instructions together.
So a smarter compiler can actually effect

211
00:16:31,229 --> 00:16:34,473
your instructions for programming.
It can actually change.

212
00:16:34,473 --> 00:16:40,441
The next thing is cycles per instruction.
Well, this is actually dependent on your

213
00:16:40,441 --> 00:16:45,271
micro-architecture.
As we talked about, we'll talk a little

214
00:16:45,271 --> 00:16:50,074
bit more in a second.
The cycles per instruction, if you have a

215
00:16:50,074 --> 00:16:55,142
pipeline processor you might be able to
execute, or you, you might take five

216
00:16:55,142 --> 00:16:59,579
cycles to execute an instruction, or
complete the full instruction on something

217
00:16:59,579 --> 00:17:02,323
like the micro-coded, processor we already
talked about.

218
00:17:02,323 --> 00:17:04,976
You might take ten cycles to execute a, a
given instruction.

219
00:17:04,976 --> 00:17:09,515
So you can see there's different number of
cycles per instruction on their dependent

220
00:17:09,515 --> 00:17:13,372
on the micro-architecture.
And it also depends a little bit on the

221
00:17:13,372 --> 00:17:17,619
instruction set architecture.
If you have a more complex instruction,

222
00:17:17,619 --> 00:17:21,177
you might need to take more cycles to
execute that.

223
00:17:21,177 --> 00:17:26,190
So, something like the string move
instruction, or the rep moves b

224
00:17:26,190 --> 00:17:29,499
instruction we talked about in the last
lecture.

225
00:17:29,499 --> 00:17:34,900
On x86, we'll actually turn into multiple,
micro-coded instructions on something like

226
00:17:34,900 --> 00:17:37,829
an x86 processor.
And then, finally time per cycle.

227
00:17:37,829 --> 00:17:42,228
This is dependent on your technology, or
rather the implementation technology.

228
00:17:42,228 --> 00:17:44,921
So, or the transistor and device
technology.

229
00:17:44,921 --> 00:17:48,507
If you have faster transistors that will
go down.

230
00:17:48,507 --> 00:17:50,989
It also depends on your
micro-architecture.

231
00:17:50,989 --> 00:17:54,454
If you pipeline deeper, you can decrease
the time, per cycle.

232
00:17:54,454 --> 00:18:00,759
Before we move off this slide let's,
compare three different architectures that

233
00:18:00,759 --> 00:18:05,017
we talked about so far.
Both the clocks per instruction, and the

234
00:18:05,017 --> 00:18:08,005
cycle time.
And I wanted to find clocks per

235
00:18:08,005 --> 00:18:11,622
instruction here.
So, clocks per instruction is how many

236
00:18:11,622 --> 00:18:15,007
clock cycles it takes to execute, that
instruction.

237
00:18:15,007 --> 00:18:20,080
And the inverse of this, or if you take
one over CPI, you get this other unit of

238
00:18:20,080 --> 00:18:26,024
measure which is very common in computer
architecture called instructions per

239
00:18:26,024 --> 00:18:28,691
cycle.
It's just, you know, the, the flip of

240
00:18:28,691 --> 00:18:31,797
here.
So sometimes you'll hear us talk about IPC

241
00:18:31,797 --> 00:18:37,479
and sometimes we'll talk about CPI.
And it's really important to know which

242
00:18:37,479 --> 00:18:41,892
is, which is which.
So you want a low CPI and a high IPC cuz

243
00:18:41,892 --> 00:18:46,761
that means better performance.
Okay, let's look at the micro architecture

244
00:18:46,761 --> 00:18:50,600
here.
So we had our micro coded architecture and

245
00:18:50,600 --> 00:18:56,723
this took multiple clocks per instruction
but the cycle time is relatively short.

246
00:18:56,723 --> 00:18:59,976
Okay.
We had our single cycle unpipelined

247
00:18:59,976 --> 00:19:05,160
processor, so this is, everything is just
a flow to the end, but there's no pipeline

248
00:19:05,160 --> 00:19:07,881
stages in it.
And the clock cycle was one, but

249
00:19:07,881 --> 00:19:12,135
unfortunately the cycle time is long.
So we have to do sort of all the work the

250
00:19:12,135 --> 00:19:16,390
micro-coded instruction did but it was
doing it over multiple cycles, but we

251
00:19:16,390 --> 00:19:21,552
needed to do it in one cycle so we have to
sort of add all those things together and

252
00:19:21,552 --> 00:19:24,313
it's all additive so we have a long cycle
time.

253
00:19:24,313 --> 00:19:28,613
And then finally we have a fully
pipedlined processor where the control is

254
00:19:28,613 --> 00:19:33,479
pipelined and the data path is pipelined.
And here, the clocks for instructions

255
00:19:33,479 --> 00:19:36,791
coming out is one, and the cycle time is
short.

256
00:19:36,791 --> 00:19:45,878
And as a question that I want to ask is we
talked briefly about the multi-cycle

257
00:19:45,878 --> 00:19:52,522
unpipelined control processor.
So this is the processor where the, you

258
00:19:52,522 --> 00:19:57,453
have pipelined the datapath.
But you have not pipelined the control.

259
00:19:57,453 --> 00:20:02,615
And you launch an instruction that has to
go down the pipe to the end before you can

260
00:20:02,615 --> 00:20:06,456
go and fetch the next instruction, but it
takes multiple cycles.

261
00:20:06,456 --> 00:20:10,134
So, the question I have is, what should we
put here in CPI?

262
00:20:10,134 --> 00:20:14,829
And what is the cycle time relative to the
rest of these?

263
00:20:14,829 --> 00:20:24,446
So I'll let you think about that.
And then that's, that's right, the clocks

264
00:20:24,446 --> 00:20:29,563
per instruction for this example is going
to be more than one or in the five stage

265
00:20:29,563 --> 00:20:34,103
pipeline it's going to be five, or say for
every instruction but the cycle time is

266
00:20:34,103 --> 00:20:37,372
going to be short.
So, this not a great, great when you

267
00:20:37,372 --> 00:20:42,639
compare against the pipelined, the fully
pipe lined processor because the clocks

268
00:20:42,639 --> 00:20:48,029
per instruction is five versus one, and
they have the same cycle time.

269
00:20:48,029 --> 00:20:53,075
So, this is gonna be five times slower
than something like that, on average.

270
00:20:54,083 --> 00:21:01,019
Okay, so let's graphically show some more
clocks per instruction examples here.

271
00:21:01,019 --> 00:21:07,007
So if we look at the micro coded machine,
we're going to be executing three

272
00:21:07,007 --> 00:21:10,303
instructions here.
Instruction one, instruction one,

273
00:21:10,303 --> 00:21:16,001
instruction three, and time is increasing
as you go to the right.

274
00:21:17,053 --> 00:21:21,095
Let's say the first instruction takes
seven cycles to execute, the next

275
00:21:21,095 --> 00:21:25,061
instruction takes five and the third
instruction takes ten.

276
00:21:25,061 --> 00:21:30,052
And if you do the math here, you end up
with three instructions executing in 22

277
00:21:30,052 --> 00:21:36,020
cycles, or a CPI of 7.33.
If you look at the unpipelined processor

278
00:21:36,020 --> 00:21:42,034
we have to, basically flow through all the
data to the end, every instruction takes

279
00:21:42,034 --> 00:21:48,070
the same amount of time here.
But the cycle time is long as I have

280
00:21:48,070 --> 00:21:51,040
drawn.
So you have three instructions.

281
00:21:51,040 --> 00:21:56,007
It takes three cycles, the CPI is one, but
the cycle time is high.

282
00:21:56,088 --> 00:22:01,053
And, in an ideal pipeline machine, we'll
actually have multiple instructions

283
00:22:01,053 --> 00:22:05,094
executing at the same time.
So time goes left to right on this graph.

284
00:22:06,044 --> 00:22:10,078
And, we'll be able to cut the
instructions, and in an ideal world we'll

285
00:22:10,078 --> 00:22:14,045
have, CPI of one, and our clock frequency
will also be short.

286
00:22:14,045 --> 00:22:19,035
And you can see, this is sort of smaller
than those other solutions, and this is

287
00:22:19,035 --> 00:22:24,025
why we, start to think about pipe-lining.
But we're using transistors to do this.

288
00:22:24,098 --> 00:22:29,818
Okay, so as, as we go along here, some of
the questions that come up is, what are

289
00:22:29,818 --> 00:22:32,039
the assumptions we made about this
pipeline?

290
00:22:32,039 --> 00:22:36,045
And why are we making these assumptions?
And if the assumptions were to change we'd

291
00:22:36,045 --> 00:22:39,098
actually pipeline differently.
So one of the assumptions is that you can

292
00:22:39,098 --> 00:22:43,060
put something like a cache in your
processor and have very fast memory

293
00:22:43,060 --> 00:22:45,080
access.
Because the memory access was so slow

294
00:22:45,080 --> 00:22:49,013
relative to the processor's speed.
You probably wanna, wanna put that

295
00:22:49,013 --> 00:22:52,050
anywhere in your processor and you might
want to pipeline the memory.

296
00:22:52,050 --> 00:22:58,035
Another assumption is that you have fast
arithmetic logic units to do your, your

297
00:22:58,035 --> 00:23:03,494
math, and you also have multi-port
register files, which, are maybe slower

298
00:23:03,494 --> 00:23:08,066
but you can still build them.
So, if you take these sort of technology

299
00:23:08,066 --> 00:23:13,083
assumptions all together, this assumption
is reasonable for a five stage MIPS

300
00:23:13,083 --> 00:23:16,054
pipeline.
But it changes whenever technology

301
00:23:16,054 --> 00:23:21,023
changes, or rather device technology
changes, and you have to reevaluate this.

302
00:23:21,023 --> 00:23:25,092
In this class and iin previous classes you
focused on five stage pipelines, and we'll

303
00:23:25,092 --> 00:23:30,023
talk a fair amount about five stage
pipelines, but we'll also talk about much

304
00:23:30,023 --> 00:23:33,030
deeper pipelines.
We'll also talk about how to go about

305
00:23:33,030 --> 00:23:37,033
building those processors.
And to give you an example some commercial

306
00:23:37,033 --> 00:23:42,019
designs actually this, this is outdated if
you look at something like a Pentium four,

307
00:23:42,019 --> 00:23:46,078
they have or Intel Pentium four, I think
there is about 44 stages in that pipeline.

308
00:23:46,078 --> 00:23:51,064
So they figured out how to cut up execute
warning instruction into 44 pieces little

309
00:23:51,064 --> 00:23:54,002
pieces.
Okay.

310
00:23:54,002 --> 00:24:00,011
So let's talk about pipe line diagrams.
So one of the important things we need now

311
00:24:00,011 --> 00:24:02,060
is to.
Think about the nomenclature, and think

312
00:24:02,060 --> 00:24:05,058
about how to draw instructions executing
down a pipeline.

313
00:24:05,058 --> 00:24:09,056
Because if you want to an, answer
questions about where an instruction is or

314
00:24:09,056 --> 00:24:13,074
how an instruction executes, we need some
way to draw it graphically.'Cuz if you

315
00:24:13,074 --> 00:24:17,072
don't draw it graphically, it's, it's
really hard to reason about, multiple

316
00:24:17,072 --> 00:24:21,065
things going on at the same time.
And this is what we're gonna try to solve

317
00:24:21,065 --> 00:24:23,069
here.
We're actually executing multiple

318
00:24:23,069 --> 00:24:27,019
instructions at the same time.
There's parallelism going on, there's

319
00:24:27,019 --> 00:24:30,059
multiple instructions in flight, and you
have lots of overlapping.

320
00:24:31,070 --> 00:24:39,001
So the first pipeline diagram were going
to use here will plot transactions or

321
00:24:39,001 --> 00:24:44,075
instructions versus time.
So we're going to put time on the

322
00:24:44,075 --> 00:24:49,029
horizontal axis here.
So we have time T0 to T7, and it's

323
00:24:49,029 --> 00:24:55,061
increasing.
And now we're going to have.

324
00:24:55,061 --> 00:25:01,045
An instruction flowing down the pipeline.
And right now we'll look at an idealized

325
00:25:01,045 --> 00:25:05,070
pipeline with idealized instructions.
We'll soon be looking at non-ideal

326
00:25:05,070 --> 00:25:09,070
pipelines with just dependencies between
the different instructions.

327
00:25:09,070 --> 00:25:14,071
And this line is really important for this
class and if you don't understand this, go

328
00:25:14,071 --> 00:25:18,031
back and watch this again.
Because the understanding, pipeline

329
00:25:18,031 --> 00:25:23,038
diagrams is gonna be key to this class and
being able to draw these yourself is going

330
00:25:23,038 --> 00:25:27,033
to be very, very important.
So let's, let's look at an instruction

331
00:25:27,033 --> 00:25:31,089
flowing down the pipeline here.
So, we have instruction one, at time zero,

332
00:25:31,097 --> 00:25:35,006
it's in the instruction fetch stage, at
time.

333
00:25:35,006 --> 00:25:38,054
One, it's in the decode stage, and time
two, is in the execute stage, and time

334
00:25:38,054 --> 00:25:42,016
three, it's in the memory access stage,
and in time four, it's in the write-back

335
00:25:42,016 --> 00:25:45,007
stage.
Okay, that's basic.

336
00:25:45,007 --> 00:25:52,062
Now let's put another instruction on here.
It goes through the same set of stages but

337
00:25:52,062 --> 00:25:56,059
its shifted by one, because it can't go
and use this resource.

338
00:25:56,059 --> 00:26:02,000
It needs to be fetched in the next cycle.
So it's not gonna be executed in cycle T0

339
00:26:02,000 --> 00:26:05,000
or Time T0, it's gonna be fetched in the
next time.

340
00:26:05,000 --> 00:26:10,008
And what you'll notice here is these
resources, or these pipeline stages, don't

341
00:26:10,008 --> 00:26:13,047
have multiple instructions in them at the
same time.

342
00:26:13,047 --> 00:26:17,064
They're restricted to only one in each
pipeline stage at a time.

343
00:26:18,042 --> 00:26:23,013
So we put on more instructions here.
And this is what our pipeline diagrams are

344
00:26:23,013 --> 00:26:27,061
mostly going to look like in this class,
cuz we're going to have transaction or

345
00:26:27,079 --> 00:26:32,236
instruction versus time pipeline diagrams.
And we're going to plot these out, and

346
00:26:32,236 --> 00:26:35,092
you'll see that instructions sort of flow
down the pipe.

347
00:26:36,050 --> 00:26:41,650
And what's, what's really cool about this
is, if you draw a vertical slice, or you

348
00:26:41,650 --> 00:26:46,212
draw one slice through here, you can
freeze time, and see what is in every

349
00:26:46,212 --> 00:26:50,016
stage of the pipe.
So if we freeze time here, t4, instruction

350
00:26:50,016 --> 00:26:55,028
five is in the instruction fetch stage,
instruction four is in the decode stage,

351
00:26:55,028 --> 00:27:00,046
instruction three is in the execute stage,
two is in the memory access stage, and

352
00:27:00,046 --> 00:27:03,096
then instruction one is writing back and
finishing.

353
00:27:04,070 --> 00:27:09,053
And it's really important that when you're
drawing these diagrams that only one

354
00:27:09,053 --> 00:27:14,025
instruction is in each stage at a time.
If you ever draw something where you have

355
00:27:14,025 --> 00:27:18,080
let's say, this instruction is in the
write back stage and this instruction is

356
00:27:18,080 --> 00:27:22,048
in the write back stage that's a hazard or
a structural hazard.

357
00:27:22,048 --> 00:27:26,097
You made a mistake somewhere in your
diagram, or you had to stall a processor

358
00:27:26,097 --> 00:27:31,010
or something else has to happen.
So make sure that when you sliced your

359
00:27:31,010 --> 00:27:35,065
time vertically here, or look at one time
segment that you don't have multiple

360
00:27:35,065 --> 00:27:38,027
things in that time segment at the same
time.

361
00:27:39,027 --> 00:27:43,048
We can also draw a different pipeline
diagram.

362
00:27:43,048 --> 00:27:50,080
So this pipeline diagram is instead going
to plot space versus time or resource

363
00:27:50,080 --> 00:27:54,649
verses time.
And this is going to be less common in

364
00:27:54,649 --> 00:27:58,041
this class, but it's still useful to think
about.

365
00:27:58,041 --> 00:28:01,092
So, what we have here is we have time on
the horizontal access.

366
00:28:01,092 --> 00:28:04,042
And we have resource on the vertical
access.

367
00:28:04,042 --> 00:28:07,053
So we have here the fetch decode, execute,
memory access.

368
00:28:07,053 --> 00:28:11,084
And then right back stage is the pipe.
And we can see that we have different

369
00:28:11,084 --> 00:28:16,014
instructions flowing down the pipe.
So instruction fetch has instruction one,

370
00:28:16,014 --> 00:28:20,078
then instruction two, and then instruction
three, instruction four, instruction five

371
00:28:20,078 --> 00:28:23,082
flowing down it, and this is sometimes
useful.

372
00:28:24,002 --> 00:28:29,002
Likewise you can cut vertically here, and
see at a certain time all of the

373
00:28:29,002 --> 00:28:33,007
instructions in your pipeline.
So, it's, it's a useful diagram.

374
00:28:33,007 --> 00:28:36,085
But it's a little less useful than the,
the previous one.

375
00:28:36,085 --> 00:28:41,078
And when I refer to things as pipeline
diagrams, I will be focusing on

376
00:28:41,078 --> 00:28:50,034
transaction versus time pipeline diagrams.
Okay.

377
00:28:50,034 --> 00:28:56,088
So now, we're going to move off idealized
pipelines, and talk about non-ideal

378
00:28:56,088 --> 00:29:00,092
pipelines.
Or why pipelines are maybe not ideal.

379
00:29:03,081 --> 00:29:07,923
Then we're going to introduce in the
purse, piece of nomenclature here, called

380
00:29:07,923 --> 00:29:13,011
a hazard.
So, a hazard is.

381
00:29:13,057 --> 00:29:18,046
Anything where you have multiple portions
of your pipeline interacting in a way that

382
00:29:18,046 --> 00:29:22,013
will cause, cause a problem.
So there's three, three main types of

383
00:29:22,013 --> 00:29:24,089
hazards.
We have structural hazards, data hazards,

384
00:29:24,089 --> 00:29:28,045
and control hazards.
A structural hazard is instruction that's

385
00:29:28,045 --> 00:29:33,005
in the pipeline that needs a resource
which is used by a different stage in the

386
00:29:33,005 --> 00:29:35,058
pipeline or some place else in the
pipeline.

387
00:29:35,058 --> 00:29:40,000
A data hazard is when you have two
instructions, and the two instructions are

388
00:29:40,000 --> 00:29:44,002
dependent on each other somehow.
So the one instruction generates some

389
00:29:44,002 --> 00:29:47,013
data, and the other instruction needs to
use that data.

390
00:29:47,013 --> 00:29:52,020
Well that, that's called a data hazard.
You need to make sure you get the, data

391
00:29:52,020 --> 00:29:57,087
which is computed by the early instruction
then somehow, move it to the data to the

392
00:29:57,087 --> 00:30:02,009
input of a later instruction.
And that's, kind of, key to actually

393
00:30:02,009 --> 00:30:07,043
executing code on a processor is, being
able to do that and we call this a data

394
00:30:07,043 --> 00:30:10,060
hazard.
And then finally there's control hazards.

395
00:30:10,060 --> 00:30:15,054
Control hazards are another type of
dependency, but it's a dependency which

396
00:30:15,054 --> 00:30:19,019
comes from having to change the control
flow of the program.

397
00:30:19,019 --> 00:30:23,048
So an example of this is something like a
branch or jump instruction, can cause a

398
00:30:23,048 --> 00:30:26,055
control hazard.
So it's a dependency based on some control

399
00:30:26,055 --> 00:30:30,094
decision made by an earlier instruction.
So if you have an earlier instruction that

400
00:30:30,094 --> 00:30:35,018
branches, the later instructions have to
be the ones at the target of the branch,

401
00:30:35,018 --> 00:30:38,725
we'll say, and you can't just execute the
ones which are the fall through for the