Let's start off our third topic so far in
our review of computer architecture. So this is, still review at this point.
As I said at the beginning of, of this class, if you know everything up to this
point in the class, that's a requisite or a prerequisite for this course.
We're just going to blow through everything you should know in order to
start the much more detailed, content later in the cour-, this course.
We talk about real processors you're gonna be building.
So things like out of order, multi issue, multi core, microprocessors versus simple
little processors you should have built in previous classes.
So, this is ELE 475 at Princeton University.
We're talking about computer architecture, and today we're going to be reviewing
caches. And this is the last topic cache review as
I said before we move on to a new material and the new material we're gonna be moving
onto very soon is superscalars or processors which can execute multiple
instructions per cycle. So let's take a look at the agenda for the
cash review lecture. We're going to start off by talking about
memory technology and motivate the different.
Or see we start off talking about memory technology and use that as a motivation
for caches. So we're gonna look at different types of
technology, DRAM versus SRAM, what the transistor technologies are behind them
versus or, or why we have these different memory technologies?
Why don't we just have one? What are there different ways to store
information? Why, why don't we just use register files
for everything? Or simple flip-flops for everything?
Then we're gonna motivate cache design or why we have caches?
So let's define what a cache is, a cache is a little piece of memory that you stick
next to something which does not have all of the information that you're trying to
store. Instead it has a subset of information and
hopefully it has a useful subset of that information.
We're then going to talk about classification of these different caches.
So we'll put names and labels on sort of different cache architectures.
So we're gonna talk about associativity of caches, we're gonna talk about size of
caches, we're gonna talk about whether the cache has multiple things that could fit
in one set or not. And we'll talk a bunch more about that.
And then we'll have a very, very brief introduction to cache performance or some
of the things we should know about why cache is going to give you good
performance right now. Later in this course, we're gonna actually
have two more lectures about advanced caches.
Advanced caches, we're gonna talk about, more sophisticated topics with, how do you
build an implement actual caches for very high performance processors?
So let's start talking about memory technology.
Okay. So, so let's, let's look at a, a memory.
Or something that can store multiple values, and give it back some data.
So here we have a very naive, flip flop based, register file.
You'd probably never actually build this in a modern micro processor.
But, let's, let's look at that sort of conceptual idea here of what, what a
memory really is. Here we have four entries, that are flip
flops. Maybe its multi bit wide?
So its maybe, I don't know, lets say these, each of these registers are 32 bits
wide and each of these buses is 32 bits wide.
And we're gonna sort of select, based on a readdress some value.
So we can choose, you know, this is two bits here, we can choose one of four
places. We have right data that comes in and we
just sort of broadcast that to all of the registers.
And we write depending on the right address here and the clock.
And, you know, if the decoder tells us to light this up and the clock clocks the
actual element, it will load in the value. You may even have a write enable on
something like this where that would also feed, let's say, into the end dates here.
So it'll say, if it writes not occurring, don't actually load anything into these
registers. Okay, so that's really, really naive.
Let's take something a little bit more complex and how these things sort of start
to actually, actually look. Okay, so now we're gonna transition from
talking about this naive register file and see what people actually built.
And, the first thing you should realize is, if you go to build something that's
naive register file you have lots and lots of bits as you add more and more bits
here. Here we only have four, but if we go to a
thousand bit register file this multiplexer goes really big and if you
stack these all up a thousand long your aspect ratio your x versus y or the size
when you go to lay it down on a piece of silicon is, is very long and very narrow.
So you will end up let's say with a 1000 bits, one bit tall.
And that does not lay out very well, and it's not necessarily good for speed or
from a area perspective. So why is it not good for speed?
Well, it's not good for speed because wire, wires have cost and take time to
propagate. So you have to propagate from 1000 bits
away or 1000, bit cells away, all the way over to the multiplexer.
Your cycle time is not gonna be very fast versus if you would somehow make it more
square or more rectangular, the distances actually are minimized.
So this is how we end up with a rays or memory rays.
And we're going to look at a memory array for register file first.
Let's look at a register file array, and how you could possible actually go build
this, because it's a little bit different. We're not going to use fully complimentary
logic here for everything. Instead we are going to start to
transition to something where we will have more of analog circuits connecting our
bits to respective buses. And, we're also going to change how we do
the decode. We're not gonna have a full multiplexer
here, or a full decoder here. We're going to, split that into different
portions. So here we have an example array.
It's a square and we, each of these little boxes is a single storage element, bit
cell. And the address comes in here and that
goes up into a row decoder, which will turn on one of these wires at a time, and
these are called word lines. And we'll have both read word lines and
write word lines. This is still just for register file.
But we're actually gonna split out some of the bits here of our address and send it
over to here, which is the column decoder. And that's going to choose, this is, this
column decoder is just a multiplexer, very similar to this multiplexer.
But it's gonna only have a subset of the bits.
Instead of having all of the bits that are together here going through this, this
multiplexer and having N bit address coming into it, instead has a, a subset of
that. So, in this diagram here, we have a four
bit wide readout, or write, and we're going to have one, two, three, four, five,
six, seven, eight bits in, bits across here.
So we're gonna need a two to column decode.
So this is a one bit decode here. But as a, this is just a toy example.
As we go to build sort of 1000 bit arrays or megabyte arrays or gigabyte arrays or
giga-bit arrays we're gonna have lots of bits coming here and lots of bits there.
But the, the ideal one to get across is you split off a portion of the address to
the word-lied creation and a portion of the address to your column decode here.
And this register file, I wanted to just sort of walk through.
What this diagram over here, the circuit little diagram is.
Here we have two cross coupled inverters. And as you'll see we don't have full
complimentary logic connecting these two things, but this is, this is a stable
storage cell. If you give this power it'll store data.
If we wanna do a read, we need to connect to the output of this to read bit lines.
And these read bit lines are these vertical wires running here.
And we're gonna use, we're gonna do this with effectively a pass gate here.
When we energize the read word line, it connects the output of this gate or output
of, it's gonna be this inverter here to the read bit line.
And if we wanna do a write, we're going to turn on, not the read word line, RWL here.
But we're trying the write word line, WWL and what that's gonna do is it's going to
connect both the Q and Q bar to the right bit line.
And, now we get into this sorta some analog, magic here.
But, in order to have this work, what we're gonna do is we're gonna put, we're
gonna energize, or, or ground the write bit line so that it's stronger than what's
going on of these two, or stronger than what either of these inverters can drive.
So we overpower the inverter. And we can flip it, let's say from a zero
to one, or a one to a zero. So this is not traditional sort of
complementary logic, but this how we typically build register files.
Small little arrays that are close, let's say in to a processor.
Now, lets move to something, a little bit larger.
We are gonna look at, memory arrays. So, these are things like SRAMs.
So we go from register files, something that we'd hold our general purpose
register for processor in. And we are gonna move out a little bit and
talk about small arrays, something like caches, maybe a kilobyte of data, and
these are typically built out of SRAMs. Same structure, we're gonna change the
cell a little bit here. The main difference is the cell we're
gonna have two sets of bit lines. We're gonna have bit and bit bar.
And then we're gonna have two pass gates here and these are cross coupled inverters
in the middle. And by doing this, we also have this, this
word line running the other direction which connects this cross couple inverter
to the bit lines. By having these, SRAM arrays typically,
what happens is, because the, because you have bit and bit bar, you can have these
be even weaker than normal, and when you go to build something like an SRAM down
here, in addition to the column decode, you're also gonna have, a word called,
sunsamps. These are basically operational
amplifiers, where you hook the bits and the bit bar to them, and it can sense a
very small difference between the bit and bitbar.
And by doing this, you can effectively have a very low difference between bit and
bit bar, and be able to sense it at the, at the other side.
And one of the differentiations I wanted to make between a register file and this
SRAM here is that the register files, many times are designed to be sort of
multi-ported. But in this diagram here, these bit lines
get reused. They are both read and write bit lines.
So, we have effectively built a single-ported SRAM.
Having said that people do build multi-ported SRAMs, and single ported
register files, but conventionally you build a register file when you need speed
and you need lots of ports and you build an SRAM when you want to be more dense in
your storage. Okay, so now let's move to a piece of
technology which is in all of your computers or register files and SRAMs are
all in your computers, but this is something that you can actually like see.
'Cause usually the SRAMs and the register files are all integrated onto your
centralised microprocessor, you don't actually get to go, see it.
But here we have a stick of DRAM. So let's see what this is.
This is PC100 DRAM and this old stuff. This 128 megabytes of RAM.
Nowadays your computer has, at least this laptop here has four gigabytes of DRAM.
And different people have different amounts, but DRAM on the contrast with
SRAM, you still go and build an array out of it.
But the actual storage looks very different.
The bit cell storage, instead of being some form of cross coupled inverse.
Instead, you're gonna have one transistor which hooks a capacitor into your bit
line. Now you may ask yourself how do you build
a capacitor? The capacitor typically you'd need, you
know, two plates or two pieces of metal with some dielectric in the middle.
We can store charge. Well, what's inside of that RAM looks very
odd. It, it is a capacitor, but it's a very
oddly shaped capacitor. Typically what'll happen is you build
these very, very deep trenches very long and skinny trenches.
You want the skinny cause you wanna put them very close together, cause if you
want gigabytes of gigabytes of RAM, the smaller you make it, the more you can
shove on a, a single piece of silicon. But you have this really long and narrow
trenches here and you have two plates of metal and then a dielectric in the middle.
So in this case here, we have, I think, there's two metal sort of plates here and
then there's some dielectrics sort of shoved in between there, but you can't
really see it in this picture. And then all the actual logic, is up here.
So, in this diagram here, this is the transistor.
And [inaudible], and here's the, the, depletion region, here's our word line.
But basically, that's, that's this transistor here.
So, we're going to be connecting the capacitor which has a very funny aspect
ratio, very tall to this bit line. And to give you some idea.
This is a slice through the silicon wafer. This is not a plan view or top view.
Top view, you would just see the, looks like a transistor and then some poly plug
here running vertically it would look really small.
But this is a slice. And the reason we do this is cause we
wanna see that the actual capacitor is very long, very deep into this.
And what they want to point out is this is typically really hard to go build on your
standard CMOS process. So this is hard to go build on something a
logic process. You have to go build this, let's say, in a
special DRAM only manufacturing process. So you want a, it's sometimes hard to mix
that. There are some technologies which allow
you to mix them, but when you do that, people haven't really designed ways to
make them as small. But DRAMs cells are small.
Okay, so why, what are the advantages of DRAM?
Why are we even talking about DRAM? Well, it's a lot easier in DRAM to have
large amounts of storage. You can have big amounts of storage in the
same area. Because instead of having, I don't know,
in SRAM, six or eight transistors for each cell.
Instead, we now have one transistor and one capacitor.
So it's actually less. Now, how does this circuit work?
Well logically, what we're gonna do is we're gonna store data, or store charge in
the capacitor. So we're gonna connect to the capacitor to
the bit line. We're gonna either put a one or a zero on
the bit line. And then, we're gonna disconnect it and
it'll hopefully store the charge. And at some point in the future, we're
gonna connect it, and we're gonna discharge it into, the capacitor into the
bit line, and read out the value. And we still need something at the bottom
here, which is very sensitive to read out this bit.
And the reason is cause the capacitance that you can store in one of these little
capacitors is very small amounts, so there's a whole lot of charge in the, in
the circuit. Hm.
Okay. So what is the cell, what's, what's the
problems with this? Well, first, one of the major problems
with this, is you're gonna end up having this capacitor discharged and charged and
capacitors as you may recall, don't always store their charge, that well.
They might slowly, lose that charge, over time.
And what this turns out to be, is you're actually going to have to, refresh the
DRAM. So, you might have heard of DRAM refresh.
But typically, you know, in a modern, modern day computer, your, your DRAM will
only hold the data for, maybe a few seconds.
It used to be that it, it only held it for a few milliseconds.
Most RAM's actually decent now and you've probably seen some tacks, the people have
built around this. Like there's some encryption in tacks
where people will effectively turn off a computer and then pull the RAM out and
stick it in a different computer and the DRAM out and stick it in a different
computer. And the DRAM will still hold the charge,
still hold the information even if you remove the, the electricity or remove the
power from it because it's has a bunch of little capacitors that will store that
charge. That's, that's a funny little case with
this DRAM. Ends up being a negative.
But we're really doing this to have more, more space to, to, to store data, because
each of these bit cells is a lot smaller. Okay, so I like this diagram, this is, one
of the more key diagrams in today's lecture, here is, showing the relative
sizes of SRAM, versus DRAM, and different heights of SRAM versus DRAM.
So let's start off here with. An SRAM cell built out of logic, a logic
processor, logic CMOS technology. So that's, that's this one here.
This looks to be six transistors, so sort of optimized SRAM monologic process and
this is pretty big. Lets contrast that though, with this one
over here, we have DRAM on a memory specific process and it's tiny, so it's
not only one transistor versus six transistors.It's actually more than six
times smaller because they can optimize this.
They can go into the Z dimension here, go in out of the board.
Because, that's, that the trench capacitor goes down into the substrate.
And some of the other interesting things on, on this diagram here we have a DRAM on
an Asic process, or DRAM on a traditional Asic process, which is, which is here.
Here we have six transistor cell with local interconnect.
It's a little bit smaller, so what that means by local interconnect is you can use
polylayer of your process the poly silicon layer to do interconnections.
So you don't have to use wires for everything.
So it gets a little bit denser, the layout gets a little bit denser.
And I really like this bottom one here. Yep, the bottom one labeled A is not.
It's not four different things. Instead, what that is, this is a fully
complimentary logic cell, a storage cell built out of gates.
So, it's some number of gates put together here.
So, this is, what's this trying to get across here is, that the addition of
custom logic cells, here, this one this one, storage cells in your library is
really important. Because, otherwise your RAM's going to be
a lot, lot larger or you won't be able to fit as much memory on your machine.
Okay, so to wrap this memory technology section up, I wanna talk about some of the
tradeoffs. So in computer architecture is all about
the tradeoffs. And, why would we use one type of
technology versus another type of technology?
So, what are our tradeoffs here? Or we can go from fast, close, small
things. So things like latches and registers.
At least sort of put them together into bigger, we can put together bigger things
like something like a register file and then SRAM and have different technologies
even. And as we sort of get into bigger and
bigger memories, we got a lot more capacity, but it takes longer to access
them, kinda of definition And typically we have less bandwidth.
But if we have, small things, we have low capacity, low latency and very high
bandwidth. So it's sort of a tradeoff of capacity
versus the other positive aspects and depending on where you put it in your
memory system. You might want to trade these off