1
00:00:03,260 --> 00:00:07,345
Okay. So, we'll, we'll get started, looks
like we have a quorum today.

2
00:00:07,531 --> 00:00:12,545
So we're going to be continuing wrapping
up what we were talking about last time

3
00:00:12,545 --> 00:00:15,950
which was about address translation and
virtual memory.

4
00:00:16,136 --> 00:00:19,293
So we have like about two or three slides
leftover.

5
00:00:19,293 --> 00:00:25,212
And we mostly what we were trying to talk
about was how address translation and

6
00:00:25,212 --> 00:00:28,526
virtual memory influences the design of
caches.

7
00:00:28,526 --> 00:00:34,661
Let's, let's go back and briefly recapture
what we were talking about last time just

8
00:00:34,872 --> 00:00:44,598
put it back in everyone's memory.
So, we're talking about memory management

9
00:00:44,598 --> 00:00:46,946
and different ways to do memory
management.

10
00:00:46,946 --> 00:00:51,259
And we had talked, and, and some of these
orthogonal concepts and functions that

11
00:00:51,259 --> 00:00:54,862
memory management's trying to do.
Mainly, we're trying to translate

12
00:00:54,862 --> 00:00:57,975
addresses so that we can move around data
and remap data.

13
00:00:57,975 --> 00:01:02,342
We were talking about protection, which
would allow us to run multiple different

14
00:01:02,342 --> 00:01:06,820
operating systems or multiple different
applications at the same time on one chip.

15
00:01:07,620 --> 00:01:12,639
And we also started to talk about virtual
memory, but this is kind of where we were

16
00:01:12,639 --> 00:01:16,141
left off last time.
We were talking about we talked about

17
00:01:16,141 --> 00:01:19,098
paging and we briefly touched on demand
paging.

18
00:01:19,098 --> 00:01:24,005
So, we want to continue on that today and
then look at some implications of the

19
00:01:24,005 --> 00:01:29,612
design, issues of virtual memory and the
ability to have lots of different address

20
00:01:29,612 --> 00:01:34,200
spaces intermixed at the same time with
different mappings, and how they

21
00:01:34,200 --> 00:01:39,424
intermixes with having lots of memory, or
very large amounts of memory even though

22
00:01:39,424 --> 00:01:42,674
your system may have a very small amount
of memory.

23
00:01:42,674 --> 00:01:46,880
So, it's a very large amount of virtual
memory as the name applies.

24
00:01:55,540 --> 00:02:00,224
So, if you go look at a modern day virtual
memory system,

25
00:02:00,224 --> 00:02:05,160
So if you're going to run Linux on your
desktop these days.

26
00:02:05,160 --> 00:02:09,343
Your desktop, let's say, has a few
gigabytes of RAM.

27
00:02:09,343 --> 00:02:13,336
Has some small number,
Probably, one, two, three, four, maybe

28
00:02:13,336 --> 00:02:18,545
even six if you have a, a really good
desktop of gigabytes of RAM.

29
00:02:18,545 --> 00:02:21,758
And,
One of the things you may want to do is

30
00:02:21,758 --> 00:02:23,960
run applications that use more RAM than
that.

31
00:02:23,960 --> 00:02:29,506
And, we have this big storage device
plugged into our computer which has lots

32
00:02:29,506 --> 00:02:34,202
and lots of storage, namely, our disk.
Now, the disk is not the only place you

33
00:02:34,202 --> 00:02:38,678
can actually try to swap memory out over.
You can actually use other things, some

34
00:02:38,678 --> 00:02:43,046
people have built virtual memory systems
that swap memory across the network, for

35
00:02:43,046 --> 00:02:47,136
instance, to another computer.
This is actually found to be fast in the

36
00:02:47,136 --> 00:02:51,167
original network of workstation days.
So, this is a project at Berkeley, where

37
00:02:51,167 --> 00:02:55,521
they had lots of computers, lots of old
Sun computers on a network and they would

38
00:02:55,521 --> 00:02:59,660
actually able, they're actually able to
swap memory across the network to the

39
00:02:59,660 --> 00:03:02,885
neighboring computer and use the
neighboring computer's RAM,

40
00:03:02,885 --> 00:03:07,185
Effectively to make the, the RAM size
bigger of the machine that they were using

41
00:03:07,185 --> 00:03:08,690
or effectively swap out.
Now,

42
00:03:08,690 --> 00:03:13,697
What are the key things to make it all
this work is having this notion of demand

43
00:03:13,697 --> 00:03:16,627
paging.
And, demand paging, basically, is around

44
00:03:16,627 --> 00:03:21,820
this idea that the operating system only
maps a page when it absolutely has to.

45
00:03:21,820 --> 00:03:26,037
And the operating system can also decide
to kick things out and put that memory,

46
00:03:26,037 --> 00:03:28,199
let's say, on disk.
If the memory is clean,

47
00:03:28,199 --> 00:03:31,679
What I mean by clean is, it's not dirty
relative to the process.

48
00:03:31,679 --> 00:03:35,897
So, the processor did not go write any
data to that memory and there's the exact

49
00:03:35,897 --> 00:03:39,007
same blocks on disk.
When it goes to kick it off, it doesn't

50
00:03:39,007 --> 00:03:42,856
have to go save that dirty data.
It can just mark it as saying, oh, it's on

51
00:03:42,856 --> 00:03:46,020
the disk already over there so don't,
don't worry about this.

52
00:03:46,020 --> 00:03:50,072
If you go take an operating systems class,
you're actually going to implement a

53
00:03:50,072 --> 00:03:53,560
demand paging scheme.
I know they do that here at the at

54
00:03:53,560 --> 00:03:56,689
Princeton's operating systems class.
So, you can try out some different

55
00:03:56,689 --> 00:04:00,690
algorithms on how to pull data in and out
cuz there's different management

56
00:04:00,690 --> 00:04:03,460
algorithms you can use to do, to do this
demand paging.

57
00:04:03,740 --> 00:04:07,960
But if we went back and looked at sort of,
how address translation works from the

58
00:04:07,960 --> 00:04:11,336
hardware level, because that's what we're
focusing on in this course.

59
00:04:11,336 --> 00:04:15,060
You take an address, and you run it
through a translation look-aside buffer.

60
00:04:15,360 --> 00:04:18,884
If you have a hit, this is the
everything's going well case.

61
00:04:18,884 --> 00:04:23,782
You have a hit, so the address is in your
translation look aside buffer cache of the

62
00:04:23,782 --> 00:04:28,022
page table.
You also probably want to check some

63
00:04:28,022 --> 00:04:33,265
protection bits, so to see if you can do a
read or write, or if the process that is

64
00:04:33,450 --> 00:04:37,213
reading and writing the data is in the
correct address space.

65
00:04:37,213 --> 00:04:42,641
So, for instance you don't want to have an
application trying to read or write the

66
00:04:42,641 --> 00:04:46,404
address space of the operating system, or
something like that.

67
00:04:46,404 --> 00:04:49,512
If, if it's permitte, the translation come
out of the TOB,

68
00:04:49,512 --> 00:04:52,029
Everything's good and you just go access
the data.

69
00:04:52,029 --> 00:04:55,150
If not, you're going to end up here in
this protection fault case.

70
00:04:55,150 --> 00:04:58,522
Now, this protection fault case is going
to be in the operating system.

71
00:04:58,522 --> 00:05:02,952
And the operating system will probably try
to kill the processor or, or has some sort

72
00:05:02,952 --> 00:05:06,859
of segmentation fault.
Alternatively, if you're up here in the,

73
00:05:06,859 --> 00:05:11,274
the TOB look up and, and you take a miss
in the TOB, there's lots of interesting

74
00:05:11,274 --> 00:05:13,845
things that start to happen now at this
point.

75
00:05:13,845 --> 00:05:18,260
If you have a hardware pagewalker, so what
that means is a little finite state

76
00:05:18,260 --> 00:05:22,786
machine which will walk through the page
table, walk all they way till' the end of

77
00:05:22,786 --> 00:05:26,140
the page table, find the mapping and
install it into the TOB.

78
00:05:26,500 --> 00:05:31,127
This and this is all done in hardware.
Hence, it's speckled.

79
00:05:31,127 --> 00:05:36,232
If you are on an architecture or something
like Mips or Alpha, or Telera, that's all

80
00:05:36,232 --> 00:05:39,615
done in software.
So, this pagewalker and updating the TLB,

81
00:05:39,615 --> 00:05:43,592
is actually done in software.
So we'll have some software going, and

82
00:05:43,592 --> 00:05:47,985
walking it, and doing a bunch of memory
references in your little piece of

83
00:05:47,985 --> 00:05:51,309
software there.
And then, there're some hybrid approaches.

84
00:05:51,487 --> 00:05:56,117
So for instance, in Mips there's some
special hardware that helps the software

85
00:05:56,117 --> 00:06:01,224
walk the page table faster.
If the memory that you try to go get at is

86
00:06:01,224 --> 00:06:05,827
not in the page table already, then, you
sort of follow here in the operating

87
00:06:05,827 --> 00:06:08,916
system line.
You take it into op, into the operating

88
00:06:08,916 --> 00:06:12,270
system.
The operating system then has to go look

89
00:06:12,270 --> 00:06:16,816
through all its structures and see, oh, is
that data on the disk somewhere? Is it,

90
00:06:16,816 --> 00:06:20,624
are you accessing some piece of memory
that doesn't actually exist?

91
00:06:20,624 --> 00:06:24,944
If that's the case, then you're actually
going to be sort of, going over this

92
00:06:24,944 --> 00:06:30,002
segmentation fault or, or bus error world
also cuz you're basically going to try to

93
00:06:30,002 --> 00:06:33,640
do a memory reference to some piece of
memory which isn't there.

94
00:06:34,985 --> 00:06:41,480
But, if, let's say, it's on disk or it's
in the swap memory or the backing page

95
00:06:41,480 --> 00:06:46,566
cache on the, on the disk,
You're basically going to have the OS fill

96
00:06:46,566 --> 00:06:50,165
in everything, fill in the TOB and just
return.

97
00:06:50,165 --> 00:06:54,000
And life is, life is all good,
And you continue on.

98
00:06:55,800 --> 00:07:00,935
So, now that we have decided that we want
something like virtual memory and we've

99
00:07:00,935 --> 00:07:06,196
decided that we want to have translation
look aside buffers, let's look at how this

100
00:07:06,196 --> 00:07:09,240
influences the design of our hardware
pipelines.

101
00:07:09,840 --> 00:07:17,140
So, here we have addressed translation,
shoehorned into a 5-stage pipeline here.

102
00:07:17,680 --> 00:07:23,302
As some of you may notice,
Here and here, we just added to the delay

103
00:07:23,302 --> 00:07:26,700
of these stages.
We just sort of shoehorned it in.

104
00:07:26,700 --> 00:07:30,930
We didn't add an extra stage, we just sort
of put it in there.

105
00:07:30,930 --> 00:07:35,840
And it's serial.
Is the naive approach to go do this?

106
00:07:35,840 --> 00:07:39,458
Hmm, well, that has some, has some serious
latency considerations.

107
00:07:39,458 --> 00:07:43,899
So, if your instruction memory is on your
critical path to your processor and, all

108
00:07:43,899 --> 00:07:48,065
of a sudden, you put something else in
serial fit your processor gets slower.

109
00:07:48,065 --> 00:07:52,451
Or if your data cache is on your critical
path to your processor and you add

110
00:07:52,451 --> 00:07:56,344
something in that, that's, that's going to
also slow down your processor.

111
00:07:56,344 --> 00:08:01,004
So, we want to look at techniques where we
can actually move those two structures off

112
00:08:01,004 --> 00:08:04,758
the critical path.
Alternatively, we can start to think about

113
00:08:04,758 --> 00:08:09,194
having something where we could pipeline
the TOB and the cache, and you have one

114
00:08:09,194 --> 00:08:11,966
stage for TOB look up and one stage for
the cache.

115
00:08:12,132 --> 00:08:16,623
It gets a little hard in that, maybe over
here in the data, data side of the world,

116
00:08:16,623 --> 00:08:21,502
it gets a little hard because that's going
to increase your access time to your data

117
00:08:21,502 --> 00:08:25,937
memory. So, when you do a load, this is
going to push out that load an extra cycle

118
00:08:25,937 --> 00:08:30,228
if you put another pipe stage in there.
And in the instruction side, this could

119
00:08:30,228 --> 00:08:33,721
also hurt your branches.
So, if you add an extra cycle out on the

120
00:08:34,049 --> 00:08:38,579
if you put a pipeline stage something like
here, then your PC plus four loop gets a

121
00:08:38,579 --> 00:08:41,908
little bit harder.
Not, from a time perspective, it probably

122
00:08:41,908 --> 00:08:46,329
doesn't get harder, but from a if you have
a branch perspective, and you take a

123
00:08:46,329 --> 00:08:50,423
branch in there, you're going to add an
extra cycle to your branch, mis-predict

124
00:08:50,423 --> 00:08:53,174
latency.
And, also, it gets a little bit weird here

125
00:08:53,174 --> 00:08:56,788
that you can access instruction memory
effectively in one cycle.

126
00:08:56,788 --> 00:09:01,248
Having said that, it's usually easier to
take instruction, to be off the critical

127
00:09:01,248 --> 00:09:05,878
path because you don't really change the
high order bits of your instruction that

128
00:09:05,878 --> 00:09:08,362
often.
You really don't change that, you only

129
00:09:08,362 --> 00:09:12,653
crosspage boundaries when you do out of
the jump or if you fall, happened to fall

130
00:09:12,653 --> 00:09:16,718
through of the end of the page and both
those are relatively rare cases.

131
00:09:16,718 --> 00:09:21,347
So, people found that pretty easy sort of,
to take that of the critical translation

132
00:09:21,347 --> 00:09:23,928
path.
So, we're also going to focus on the data

133
00:09:23,928 --> 00:09:25,320
side of the world here.