And now we're going to move on to talking
about basic pipelines, and we will start off by just talking about how to build a
pipeline and how to name things in a pipeline, and how instructions flow down a
pipeline, and why you want to build the pipeline.
And we start off with a very gentle introductions to this, instead of my
talking about, a idealized pipeline, not necessarily in a processor.
Pipelines show up many places in the world.
They show up in car factories, toy factories.
I know some people will say, you know, when you go use the washing machine in a
laundry-mat you're pipelining cause you first put your, laundry in the washing
machine and then take it out. Then you can put it in the dryer, we can
put more work in the, or more laundry in the first washing machine.
So we can, we see pipelines in lots of different places in the world.
But we also see it, in, in this class we're gonna care about using pipelines in
microprocessors. So let's think about an idealized
pipeline. So I have, have a picture here of an
idealized pipeline. What are some of the good, or what are
some of the requisite. Conditions for an idealized pipeline.
Well, first of all, all work, or all objects should go through all the stages.
So here we have four stages in this pipeline, and there's no sort of squiggly
lines where things sort of come out of the pipeline, or go around the pipeline, or
exit early. It doesn't happen in this, this diagram.
So, in an idealized pipeline, you want all of the objects to go through all the
stages. You don't want any resources shared
between different stages. So you don't want some resource here which
is stared, shared between stage two and stage three So an example of this in a car
assembly pipe pipeline or assembly line would be one tool which two different
stages in the pipeline have to use or two different stages have to use that would
cause what is known as a structural hazard in processor design.
The propagation delay between all the different, of all the different stages
should be the same. So the time taken for stage one should be
the same as two, stage two, the same as stage three, the same as stage four in an
idealized pipeline. And then, finally, the scheduling of an
operation, or a transaction, or an object going down the pipeline should not be
affected by what's currently in the pipeline.
And, these conditions actually hold true for most assembly lines in plants where
people go and build cars or something like that.
There's no dependence between the different parts.
A car comes into it. All cars are sort of, separate from each
other. It's not, not really a problem.
Unfortunately, if you go and look at something like a microprocessor and
executing instructions. Instructions actually depend on earlier
instructions. And they depend on it in multiple
different ways. They can either depend on the data values
or they can depend on the fact that you took a branch or a control flow
instruction. So we have to look at these different
hazards and we have to think about how to either solve them or deal with them and
how we have to, to, we have to think about how to deal with non-ideal pipelines.
So let's go one step beyond our microcoded disk processor design and look at a
un-pipelined data path for the MIPS instruction set.
So this is going to be similar. Or if you have already taken a computer
organization class, you've probably seen similar sorts of unpipelined data paths.
So, when on a unpipeline data path, where instead of the program counter here, we
have some instruction, memory where we go and fetch our instruction out.
And, this flows through, we fetch our registers from the register file.
We flow through, we do the actual computation, we might have to go access,
data from, data memory if we do a load or restore, and then finally comes back
around. We write the result here and we increment
or change the program counter. And, this is all done, in one cycle.
And it's a very long cycle, cuz you launch here, you go through all this work.
You do all these things. The data gets put back here, and you have
to, it has to happen because we're unpipelined, we have an unpipelined
processor here, it has to happen in one cycle.
So this is kind of, not great from a cycle time perspective.
It is good from a perspective of the number of cycles per instruction.
It's always one because you launch instruction, you wait until it's done,
then you launch the next instruction, you launch the next instruction, and each
instruction it launches is one cycle. So let's, let's simplify this design, a
little bit so we can talk about what's going on here in a little bit more depth.
So, we're gonna simplify our un-pipelined processor design and just take out some of
the, the extra stuff here and focus on, you learn from the program counter, it
acts as the instruction memory all combinationly, you go through the
registered file combinational logic, you go through the data memory, and you come
back around. So, that would be something like a worst
case load instruction, and, and the value gets run back here into the general
purpose register file. So let's now move on to thinking about how
do we build the same data path, but we try run a pipe line it now.
Well the first thing we want to do is we want to cut the pipeline up or cut our
design up. And we are gonna put in registers.
And, it just, so happens that this is a convenient place to put registers for
something like a MIPS datapath. And, some important things that you might
recall from your digital logic design course is that when you go to pipeline a
circuit, you need to make sure that you cut all lines which are going in this
direction across the registers. So if we were to cut here, we need to put
a register here, here and here cuz there's three sets of wires running from left to
right. This feedback path here doesn't need a
pipeline register cuz it's, it's flowing back and it's effectively running back
into the register file here combinationally.
And you know, the register file can either be clocked or unclocked depending on how
you actually do this design. So we've pipelined our data path now.
And we'll, let's put some names on these things, that we're gonna be re-using
throughout the class. So we're gonna call this the, the fetch
phase or the, the instruction fetch time. Sometimes in this class we'll either
denote this with a upper case F, or we'll denote this with I F, for instruction
fetch. The next thing we're going to do is we're
going to decode the instruction, and we're also going to fetch the registers out of
the register file. So the decode instruction this is how we
take the tightly packed bits in the instruction and we blow it up into control
wires. And the register fetch phase we're
actually going to fetch from the register file and this is sometimes we'll denote
with an upper case D, or sometimes we'll denote it with RF for register fetch.
This is the execute phase. We're actually doing real work here.
We're taking the ALU. We're doing some add, we're doing some
multiply, something like that. Maybe some subtract or a shift operation,
and we'll denote this with an upper case X or a, the letters E X, for execution.
The memory phase we're accessing the memory here, the data memory and then
finally the write back phase, and we'll denote this with m and then finally, or
mem. And then finally the write back stage here
we'll typically denote with wb or w to denote when we actually go to write the
data into the register file. We have a whole pipeline stage dedicated
to that. So, one of the, the important things, now,
I, I just picked five different places here to pipeline this, this design, but,
this is not accidental. What we're trying to do is we're trying to
divide what was a long, a, a long time period, and cut it up into smaller chunks
here. So we're trying to reduce our clock
period, and we can do this by dividing the execution up into multiple cycles.
And if we look at the time, the cycle time is greater than the maximum, or greater
than or equal to the maximum of the time to access the instruction memory.
The time to access the register file. The time to access the ALU.
The time to access the data memory. And a time to write back.
And those are the different things in the five stages here.
So we're trying to. We're not, we're not trying to do anything
here, but the, the time taken is greater than to or equal to the maximum of the
rest of these times. And it's probably, in most pipelines,
equal to the time it takes to access the data memory.
The, the time to access the data memory in most, most processors is, is probably
your, gong to be your limiting factor, but people will try to pipeline this, and
we'll talk about how to pipeline that even more, so try to cut the memory in it in
half. Okay, so that's the easy part.
We pipeline the data, so let's now move onto a little bit more challenging
question here of how to pipeline the control.
So we're going to sort of look at our control unit and we're going to put down
our hardwired control here. So the hardwired control is gonna take the
instruction register and decode the instruction that comes in.
Hm. Well.
That's, that's okay. One of the problems here.
When you start to look at this, is, well. Well actually, before we do that, let's,
let's first look at how to use this processor, in a multi cycle yet pipelined
fashion. So what do I mean by that?
Well, let's say we're gonna. Use the same data path, and use the exact
same control unit we had in the unpipelined MIPS processor.
And in this design, the cycle time is now, lets say close to one-fifth of the cycle
time we had before we add five pipeline stages.
So our processor executes faster, But we're not going to pipeline our control
unit. So what this means is we're going to fetch
the first instruction, and we're going to have to wait for the instruction to go all
the way to the end. Write the register file and be done.
So what we just did here is we built a multi cycle yet pipelined processor.
And this multi cycle yet pipelined processor has the clock period which is
say, which is roughly one fifth of the time of the unpipelined one, but it now
takes five cycles. So it's a wash.
It's might be, it's probably a little bit worse cuz of some overhead, from the
instruction overhead from the registers that you need to insert, but it's going to
roughly, roughly be a wash. So let's say we want to start building a
pipeline processor that can actually execute multiple instructions at the same
time in different locations in the pipeline.
In order to do this we need to go about adding pipeline registers to our control.
And by adding pipeline registers to our control wires we can have multiple
instructions flowing down the pipeline, and where these multiple instructions are
in different stages in the pipe. So we can have one instruction here and a
different instruction here and they have different control wires being asserted on
them because we've pipelined the control wires.
So now we're gonna start talking about how do we go about analyzing the performance
of micro processors. And all performance basically boils down,
this is just for performance. There is similar sorts of thing for power
and area and other metrics you might want to care about when optimizing a processor.
But in today's lecture, we're going to focus on optimizing for performance.
There comes up this idea called the iron law of processor performance and it's
this, excuse me, we talked about a bunch in the first chapter of the computer
architecture or a quantitative approach book, and it boils down to the very basic
formula where the time to execute the program.
So this is the time it takes to run your program that you're trying to execute is
equal to how many instructions your program has, times the cycles that it
takes to execute one instruction for the program, multiplied by, the time per
cycle. So as you see it's multiplicative, so if
you can actually make any of these numbers smaller the time for the program will go
down, which is a good thing. So if you can make any of these numbers
smaller when it multiplies out we can make the time of the program smaller.
And what you also note here is they're all equally weighted, so if you change any of
them that's a good thing. One of the things that we should probably
do is think about the dimensional analysis.
Or, what is the unit on the time for the program?
Hm. Well, let's derive it from the three
things on the right side of this equation. Time per cycle.
So what's the unit on time per cycle? It's going to be seconds per cycle.
Or some time unit, or, or nanoseconds per cycle, well, let's, let's stick to seconds
per cycle. Cycles per instruction.
Okay what's the unit on that. Well the unit actually is cycles per
instruction. So if we do the dimensional analysis on
this cycles up here and cycle down there are going to cancel.
And then finally we have instructions per program and the unit on that actually is
instructions per program. So the number of instructions in your
program divided by how one the program. And if we, you know, multiply this out the
unit we're gonna get here for time per program is seconds per program or time per
program that actually is the unit. Okay so let's think about where these
different portions come from in how we can effect these three different quantities.
So instructions per program, can that the affected?
You just, you might say well, my program takes 100 instructions, and that's what it
takes. Well the world's not that simple.
If you had a different instruction set architecture, you might have different
numbers of instructions. So if you have something like the MIPS
instruction set, which is a reduced instruction set, you might actually have
to take more instructions, to execute a program, than if you were executing on
something like a X86 processor, where you have very complex instructions, or rather
where you have like one instruction that can do a whole string copy.
On something like MIPS, that would take thousands of instructions.
So the ISA really matters. The other thing, the other thing that can
matter here is the compiler. If you had a smarter compiler the compiler
might be able to reduce or eliminate code which is redundant or not very important,
or not important at all rather, or it can try to merge two instructions together.
So a smarter compiler can actually effect your instructions for programming.
It can actually change. The next thing is cycles per instruction.
Well, this is actually dependent on your micro-architecture.
As we talked about, we'll talk a little bit more in a second.
The cycles per instruction, if you have a pipeline processor you might be able to
execute, or you, you might take five cycles to execute an instruction, or
complete the full instruction on something like the micro-coded, processor we already
talked about. You might take ten cycles to execute a, a
given instruction. So you can see there's different number of
cycles per instruction on their dependent on the micro-architecture.
And it also depends a little bit on the instruction set architecture.
If you have a more complex instruction, you might need to take more cycles to
execute that. So, something like the string move
instruction, or the rep moves b instruction we talked about in the last
lecture. On x86, we'll actually turn into multiple,
micro-coded instructions on something like an x86 processor.
And then, finally time per cycle. This is dependent on your technology, or
rather the implementation technology. So, or the transistor and device
technology. If you have faster transistors that will
go down. It also depends on your
micro-architecture. If you pipeline deeper, you can decrease
the time, per cycle. Before we move off this slide let's,
compare three different architectures that we talked about so far.
Both the clocks per instruction, and the cycle time.
And I wanted to find clocks per instruction here.
So, clocks per instruction is how many clock cycles it takes to execute, that
instruction. And the inverse of this, or if you take
one over CPI, you get this other unit of measure which is very common in computer
architecture called instructions per cycle.
It's just, you know, the, the flip of here.
So sometimes you'll hear us talk about IPC and sometimes we'll talk about CPI.
And it's really important to know which is, which is which.
So you want a low CPI and a high IPC cuz that means better performance.
Okay, let's look at the micro architecture here.
So we had our micro coded architecture and this took multiple clocks per instruction
but the cycle time is relatively short. Okay.
We had our single cycle unpipelined processor, so this is, everything is just
a flow to the end, but there's no pipeline stages in it.
And the clock cycle was one, but unfortunately the cycle time is long.
So we have to do sort of all the work the micro-coded instruction did but it was
doing it over multiple cycles, but we needed to do it in one cycle so we have to
sort of add all those things together and it's all additive so we have a long cycle
time. And then finally we have a fully
pipedlined processor where the control is pipelined and the data path is pipelined.
And here, the clocks for instructions coming out is one, and the cycle time is
short. And as a question that I want to ask is we
talked briefly about the multi-cycle unpipelined control processor.
So this is the processor where the, you have pipelined the datapath.
But you have not pipelined the control. And you launch an instruction that has to
go down the pipe to the end before you can go and fetch the next instruction, but it
takes multiple cycles. So, the question I have is, what should we
put here in CPI? And what is the cycle time relative to the
rest of these? So I'll let you think about that.
And then that's, that's right, the clocks per instruction for this example is going
to be more than one or in the five stage pipeline it's going to be five, or say for
every instruction but the cycle time is going to be short.
So, this not a great, great when you compare against the pipelined, the fully
pipe lined processor because the clocks per instruction is five versus one, and
they have the same cycle time. So, this is gonna be five times slower
than something like that, on average. Okay, so let's graphically show some more
clocks per instruction examples here. So if we look at the micro coded machine,
we're going to be executing three instructions here.
Instruction one, instruction one, instruction three, and time is increasing
as you go to the right. Let's say the first instruction takes
seven cycles to execute, the next instruction takes five and the third
instruction takes ten. And if you do the math here, you end up
with three instructions executing in 22 cycles, or a CPI of 7.33.
If you look at the unpipelined processor we have to, basically flow through all the
data to the end, every instruction takes the same amount of time here.
But the cycle time is long as I have drawn.
So you have three instructions. It takes three cycles, the CPI is one, but
the cycle time is high. And, in an ideal pipeline machine, we'll
actually have multiple instructions executing at the same time.
So time goes left to right on this graph. And, we'll be able to cut the
instructions, and in an ideal world we'll have, CPI of one, and our clock frequency
will also be short. And you can see, this is sort of smaller
than those other solutions, and this is why we, start to think about pipe-lining.
But we're using transistors to do this. Okay, so as, as we go along here, some of
the questions that come up is, what are the assumptions we made about this
pipeline? And why are we making these assumptions?
And if the assumptions were to change we'd actually pipeline differently.
So one of the assumptions is that you can put something like a cache in your
processor and have very fast memory access.
Because the memory access was so slow relative to the processor's speed.
You probably wanna, wanna put that anywhere in your processor and you might
want to pipeline the memory. Another assumption is that you have fast
arithmetic logic units to do your, your math, and you also have multi-port
register files, which, are maybe slower but you can still build them.
So, if you take these sort of technology assumptions all together, this assumption
is reasonable for a five stage MIPS pipeline.
But it changes whenever technology changes, or rather device technology
changes, and you have to reevaluate this. In this class and iin previous classes you
focused on five stage pipelines, and we'll talk a fair amount about five stage
pipelines, but we'll also talk about much deeper pipelines.
We'll also talk about how to go about building those processors.
And to give you an example some commercial designs actually this, this is outdated if
you look at something like a Pentium four, they have or Intel Pentium four, I think
there is about 44 stages in that pipeline. So they figured out how to cut up execute
warning instruction into 44 pieces little pieces.
Okay. So let's talk about pipe line diagrams.
So one of the important things we need now is to.
Think about the nomenclature, and think about how to draw instructions executing
down a pipeline. Because if you want to an, answer
questions about where an instruction is or how an instruction executes, we need some
way to draw it graphically.'Cuz if you don't draw it graphically, it's, it's
really hard to reason about, multiple things going on at the same time.
And this is what we're gonna try to solve here.
We're actually executing multiple instructions at the same time.
There's parallelism going on, there's multiple instructions in flight, and you
have lots of overlapping. So the first pipeline diagram were going
to use here will plot transactions or instructions versus time.
So we're going to put time on the horizontal axis here.
So we have time T0 to T7, and it's increasing.
And now we're going to have. An instruction flowing down the pipeline.
And right now we'll look at an idealized pipeline with idealized instructions.
We'll soon be looking at non-ideal pipelines with just dependencies between
the different instructions. And this line is really important for this
class and if you don't understand this, go back and watch this again.
Because the understanding, pipeline diagrams is gonna be key to this class and
being able to draw these yourself is going to be very, very important.
So let's, let's look at an instruction flowing down the pipeline here.
So, we have instruction one, at time zero, it's in the instruction fetch stage, at
time. One, it's in the decode stage, and time
two, is in the execute stage, and time three, it's in the memory access stage,
and in time four, it's in the write-back stage.
Okay, that's basic. Now let's put another instruction on here.
It goes through the same set of stages but its shifted by one, because it can't go
and use this resource. It needs to be fetched in the next cycle.
So it's not gonna be executed in cycle T0 or Time T0, it's gonna be fetched in the
next time. And what you'll notice here is these
resources, or these pipeline stages, don't have multiple instructions in them at the
same time. They're restricted to only one in each
pipeline stage at a time. So we put on more instructions here.
And this is what our pipeline diagrams are mostly going to look like in this class,
cuz we're going to have transaction or instruction versus time pipeline diagrams.
And we're going to plot these out, and you'll see that instructions sort of flow
down the pipe. And what's, what's really cool about this
is, if you draw a vertical slice, or you draw one slice through here, you can
freeze time, and see what is in every stage of the pipe.
So if we freeze time here, t4, instruction five is in the instruction fetch stage,
instruction four is in the decode stage, instruction three is in the execute stage,
two is in the memory access stage, and then instruction one is writing back and
finishing. And it's really important that when you're
drawing these diagrams that only one instruction is in each stage at a time.
If you ever draw something where you have let's say, this instruction is in the
write back stage and this instruction is in the write back stage that's a hazard or
a structural hazard. You made a mistake somewhere in your
diagram, or you had to stall a processor or something else has to happen.
So make sure that when you sliced your time vertically here, or look at one time
segment that you don't have multiple things in that time segment at the same
time. We can also draw a different pipeline
diagram. So this pipeline diagram is instead going
to plot space versus time or resource verses time.
And this is going to be less common in this class, but it's still useful to think
about. So, what we have here is we have time on
the horizontal access. And we have resource on the vertical
access. So we have here the fetch decode, execute,
memory access. And then right back stage is the pipe.
And we can see that we have different instructions flowing down the pipe.
So instruction fetch has instruction one, then instruction two, and then instruction
three, instruction four, instruction five flowing down it, and this is sometimes
useful. Likewise you can cut vertically here, and
see at a certain time all of the instructions in your pipeline.
So, it's, it's a useful diagram. But it's a little less useful than the,
the previous one. And when I refer to things as pipeline
diagrams, I will be focusing on transaction versus time pipeline diagrams.
Okay. So now, we're going to move off idealized
pipelines, and talk about non-ideal pipelines.
Or why pipelines are maybe not ideal. Then we're going to introduce in the
purse, piece of nomenclature here, called a hazard.
So, a hazard is. Anything where you have multiple portions
of your pipeline interacting in a way that will cause, cause a problem.
So there's three, three main types of hazards.
We have structural hazards, data hazards, and control hazards.
A structural hazard is instruction that's in the pipeline that needs a resource
which is used by a different stage in the pipeline or some place else in the
pipeline. A data hazard is when you have two
instructions, and the two instructions are dependent on each other somehow.
So the one instruction generates some data, and the other instruction needs to
use that data. Well that, that's called a data hazard.
You need to make sure you get the, data which is computed by the early instruction
then somehow, move it to the data to the input of a later instruction.
And that's, kind of, key to actually executing code on a processor is, being
able to do that and we call this a data hazard.
And then finally there's control hazards. Control hazards are another type of
dependency, but it's a dependency which comes from having to change the control
flow of the program. So an example of this is something like a
branch or jump instruction, can cause a control hazard.
So it's a dependency based on some control decision made by an earlier instruction.
So if you have an earlier instruction that branches, the later instructions have to
be the ones at the target of the branch, we'll say, and you can't just execute the
ones which are the fall through for the