1
00:00:03,072 --> 00:00:13,189
Okay, so I wanna just briefly give a case
study here of one of the more interesting,

2
00:00:13,189 --> 00:00:20,476
modern day VLIW architectures.
Or probably the most famous and possibly

3
00:00:20,476 --> 00:00:25,532
also the most infamous VLIW processor out
there.

4
00:00:25,532 --> 00:00:33,115
This is the Intel Itanium, also known as
the, Intel I64, or what's known as an EPIC

5
00:00:33,115 --> 00:00:41,733
processor, Explicitly Parallel Instruction
Computing architecture.

6
00:00:41,733 --> 00:00:47,970
And a lot of this work actually.
Was done.

7
00:00:47,970 --> 00:00:54,682
In collaboration between Intel and HP.
Hp uses these a lot in their big servers.

8
00:00:54,682 --> 00:00:58,323
Their sort of big mainframe, well, not
quite mainframes.

9
00:00:58,323 --> 00:01:03,600
But big, big, heavy, big iron computers.
And Intel was trying to use this to

10
00:01:03,600 --> 00:01:07,477
effectively kill all of the other
workstation vendors.

11
00:01:07,477 --> 00:01:11,292
And this was gonna be their 64-bit
solution to computing.

12
00:01:11,292 --> 00:01:18,116
So it's a modern, non-classical VILW, and
this was going to be Intel's chosen ISA.

13
00:01:18,116 --> 00:01:24,382
There was, they were, they were going to
deprecate X-86, and choose IA-64 as the 64

14
00:01:24,382 --> 00:01:27,612
bit ISA.
And as we now know, going a few, few years

15
00:01:27,612 --> 00:01:32,368
forward after the creation of all this
stuff, that didn't really happen.

16
00:01:32,368 --> 00:01:38,087
Intel went and did this, it built a bunch
of processors with this instruction set.

17
00:01:38,087 --> 00:01:44,582
You can still buy processors with this
instruction set, but it never got, as, as

18
00:01:44,582 --> 00:01:51,674
good of a, acceptance as competitor.
The competitor is, was at the time was

19
00:01:51,674 --> 00:01:56,076
called AMD64, which is a 64 bit extension,
to what people already had.

20
00:01:56,076 --> 00:02:00,628
And that's when people ended up wanting,
it's just a 64-bit extension to what we

21
00:02:00,628 --> 00:02:04,077
already had versus, you know, something
totally different.

22
00:02:04,077 --> 00:02:09,473
Okay, so couple of features here is object
code compatible VLIW, so, it's not quite a

23
00:02:09,473 --> 00:02:13,756
VLIW in a classical sense, it's object
code compatible, which means different

24
00:02:13,756 --> 00:02:19,715
generations, different micro-architectures
in this VLIW can have the same instruction

25
00:02:19,715 --> 00:02:23,611
code in the same binaries, you know, to
recompile.

26
00:02:23,611 --> 00:02:29,871
And how they did this is effectively, as I
alluded to before, they had the ability to

27
00:02:29,871 --> 00:02:33,722
have parallelism straddle across
instruction bundles.

28
00:02:33,722 --> 00:02:39,535
And they had this notion of groups which
we'll talk about in a second.

29
00:02:39,535 --> 00:02:45,162
So, the first few implementations of this
Merced, was the first Intel Itanium

30
00:02:45,162 --> 00:02:48,332
implementation.
It was kind of like the 8086 or x86.

31
00:02:48,332 --> 00:02:53,804
But Merced has, has lots of things that
you'll realize, if you look at Intel

32
00:02:53,804 --> 00:02:57,569
codewords and Intel code names named after
a river.

33
00:02:57,569 --> 00:03:01,921
Intel likes to name their things after
either rivers or places.

34
00:03:01,921 --> 00:03:07,920
I think this has something to do with it,
its, you can't trademark a, a place name,

35
00:03:07,920 --> 00:03:13,289
so they, they just sort of get around that
and make sure they don't have any

36
00:03:13,289 --> 00:03:19,141
Trademark issues by choosing place names
with all their code names, One of the big

37
00:03:19,141 --> 00:03:22,367
problems here, was supposed to ship in
1997.

38
00:03:22,367 --> 00:03:26,223
First customer shipment not until 2001.
It's a four year miss.

39
00:03:26,223 --> 00:03:31,085
And superscalar was another thing that
sort of had caught up on it at that time.

40
00:03:31,085 --> 00:03:35,091
And it was supposed to be faster and
better than everything else.

41
00:03:35,091 --> 00:03:38,079
And the first, the first one was not very
good.

42
00:03:38,079 --> 00:03:43,770
It had cold, low clock rates, and was not
as high performance as it was supposed to

43
00:03:43,770 --> 00:03:45,097
be.
And sort of the, the.

44
00:03:45,097 --> 00:03:50,345
X86 side of Intel's business line,
actually, had almost the same performance

45
00:03:50,345 --> 00:03:55,463
as, as the first Itanium, and then very
quickly surpassed the first Itanium.

46
00:03:55,463 --> 00:03:58,625
So, their high end processor wasn't
actually high end.

47
00:03:58,625 --> 00:04:03,685
Couple, couple other things here, so,
McKinley was the second implementation,

48
00:04:03,685 --> 00:04:08,994
shipped pretty quickly after that.
This was much better implementation, but,

49
00:04:08,994 --> 00:04:14,311
you know, it's still, still hard to do,
but we're still building these things.

50
00:04:14,311 --> 00:04:18,814
So, in 2011 at ISSCC, the Intel introduced
the Poulson processor.

51
00:04:18,814 --> 00:04:24,224
Big, big, machine here, eight cores and 32
nanometer, lots and lots of RAM.

52
00:04:24,224 --> 00:04:29,408
We'll, we'll look at that, yeah, so, so 32
megabytes of shared L3 cache, big, big

53
00:04:29,408 --> 00:04:33,551
processor.
544 square millimeters, in 32 nanometer.

54
00:04:33,551 --> 00:04:38,466
So, at the time this came out, this was
the biggest processor ever built, most

55
00:04:38,466 --> 00:04:43,916
number of transistors, over three billion
transistors, or at least the biggest

56
00:04:43,916 --> 00:04:49,624
commercial thing, Intel might have had a
research prototype, I think, that might

57
00:04:49,624 --> 00:04:54,881
have had more transistors than this.
I think their, Multicore processor or

58
00:04:54,881 --> 00:05:00,670
they're they call it the SCC their, their
single chip cloud computer might have had

59
00:05:00,670 --> 00:05:04,057
it more but I know I should know the
transistor count.

60
00:05:04,057 --> 00:05:07,176
But from a commercial processor
perspective, huge chip.

61
00:05:07,176 --> 00:05:10,682
But they are selling into extremely
expensive sort of sockets.

62
00:05:10,682 --> 00:05:14,644
There is a seller ship for premium was
going into big main frames.

63
00:05:14,644 --> 00:05:18,030
That was not what was this was originally
destined for.

64
00:05:18,030 --> 00:05:22,027
It was destined for both big main frames
and work stations.

65
00:05:22,027 --> 00:05:25,127
But now this is sort of the in 2012
standing here now.

66
00:05:25,127 --> 00:05:30,160
This is not used in lots of other places
except for sort of bigger, bigger hardware

67
00:05:30,160 --> 00:05:35,674
or mainframe sort of things.
So a few of the interesting here is the

68
00:05:35,674 --> 00:05:41,383
cores are multi-threaded and you can
execute six instructions.

69
00:05:41,383 --> 00:05:48,420
You can, you can fetch six instructions
per cycle and you can execute up to twelve

70
00:05:48,420 --> 00:05:53,229
instructions per cycle.
Per core, and then there's eight cores.

71
00:05:53,229 --> 00:05:58,569
So this is a beast of a machine.
Very, very high performance computer.

72
00:05:58,569 --> 00:06:04,273
Okay, so let's dive into some of the
details here of Itanium.

73
00:06:04,273 --> 00:06:11,691
Itanium has a 128-bit instruction bundle,
and inside of there you can fit the three

74
00:06:11,691 --> 00:06:17,747
operations, and then there is some word
called template bits, which sort of says

75
00:06:17,747 --> 00:06:22,411
what is in the instruction bundle.
So it's not actually a fixed format

76
00:06:22,411 --> 00:06:26,568
bundle, these instruction boundaries can
move around a little bit.

77
00:06:26,568 --> 00:06:31,765
And they did that so you can sort of mix
in something like a immediate instruction

78
00:06:31,765 --> 00:06:36,411
with instruction, which doesn't have
immediate, and get more space in the

79
00:06:36,411 --> 00:06:41,443
bundle for the immediate bits, so you can
have or, or branch offset or something

80
00:06:41,443 --> 00:06:45,947
like that.
These template bits also describe how a

81
00:06:45,947 --> 00:06:49,847
particular bundle relates to other bundles
around it.

82
00:06:49,847 --> 00:06:55,314
So sometimes these are called begin and
end bits, or start and stop bits.

83
00:06:55,314 --> 00:07:01,020
So it says the number of instructions
which can execute explicitly in parallel.

84
00:07:01,020 --> 00:07:05,875
And the machine doesn't necessarily have
to execute these in parallel.

85
00:07:05,875 --> 00:07:10,437
So for instance, if you say twenty
instructions can execute, or twenty

86
00:07:10,437 --> 00:07:16,651
operations can execute in parallel, but
your machine's only two wide or they built

87
00:07:16,651 --> 00:07:20,408
it two wide, implementation of Itanium or
I-64.

88
00:07:20,408 --> 00:07:26,579
You're just are gonna execute, you know,
two wide for ten cycles, or something like

89
00:07:26,579 --> 00:07:29,353
that.
But, what's really cool here is the

90
00:07:29,353 --> 00:07:35,420
compiler is able, just like all the other
VLIWs, to express the parallelism to the

91
00:07:35,420 --> 00:07:39,363
machine explicitly.
Some interesting things about the

92
00:07:39,363 --> 00:07:42,923
registers.
They, because this is a VLIW processor,

93
00:07:42,923 --> 00:07:48,822
and because you're gonna have to do code
scheduling like what we saw in last class,

94
00:07:48,822 --> 00:07:52,066
that increases the general purpose
register pressure.

95
00:07:52,066 --> 00:07:56,371
You don't have a register renamer.
So you can't go and use different names

96
00:07:56,371 --> 00:07:58,753
for things.
And the hardware's not gonna rename things

97
00:07:58,753 --> 00:08:01,371
for you.
So instead, the compiler and the software

98
00:08:01,371 --> 00:08:05,090
gonna have to do the renaming.
So they had 128 general purpose registers

99
00:08:05,090 --> 00:08:09,365
and another 128 floating point registers.
And they also have these predicate

100
00:08:09,365 --> 00:08:11,861
registers.
So, they're not quite full predication,

101
00:08:11,861 --> 00:08:14,511
but they're pretty close to full
predication.

102
00:08:14,511 --> 00:08:19,052
So you can have bits that say whether our
later instructions are gonna execute or

103
00:08:19,052 --> 00:08:22,379
not and you have to compute that into a
little register file.

104
00:08:22,379 --> 00:08:26,024
So they had a predicate register file that
you have to bypass.

105
00:08:26,024 --> 00:08:28,355
So that's, that's sort of interesting to
see.

106
00:08:28,355 --> 00:08:32,835
And then they had the, really interesting
feature here, which is called, ruh,

107
00:08:32,835 --> 00:08:36,314
rotating register file.
And let's, let's talk about what a

108
00:08:36,314 --> 00:08:42,468
rotating register file is.
So the problem this is trying to solve, is

109
00:08:42,468 --> 00:08:47,283
in a code sequence as we saw before, in
last lecture.

110
00:08:47,283 --> 00:08:58,010
If you have, if you have a very long
instruction word, scheduled piece of code,

111
00:08:58,010 --> 00:09:03,094
and you want to get good performance,
you're going to have to unroll the loop,

112
00:09:03,094 --> 00:09:07,017
and then you're going to have to software
pipeline the loop.

113
00:09:07,017 --> 00:09:11,069
But when you do this, this is going to
increase your register pressure or

114
00:09:11,069 --> 00:09:15,095
increase your register names, how many
register names you need to use.

115
00:09:15,095 --> 00:09:21,002
And, as we saw, you're gonna have to add
extra special code in the prologue and the

116
00:09:21,002 --> 00:09:24,029
epilogue, which are different than the
main loop body.

117
00:09:25,024 --> 00:09:28,072
So how do you solve this in one fell
swoop?

118
00:09:28,072 --> 00:09:35,370
Well, you add a subset of your register
space, which will sort of statically

119
00:09:35,370 --> 00:09:43,034
rename itself, every loop iteration.
So slightly change the, the loop iteration

120
00:09:43,034 --> 00:09:47,090
or change the, the naming of the
registers.

121
00:09:47,090 --> 00:09:53,000
And what this looks like, is if you go to
access let's say, register R1.

122
00:09:53,000 --> 00:09:58,018
There's a register, sort of a
architectural enabled register called the

123
00:09:58,018 --> 00:10:03,079
rotating register base, or RRB here, which
has a value that gets added to this.

124
00:10:03,079 --> 00:10:09,048
And it's marginal arithmetic, so it rolls
around at the end and that points to

125
00:10:09,048 --> 00:10:13,012
different locations in the physical
register file.

126
00:10:13,071 --> 00:10:18,021
Oh, this is pretty cool.
So, what we're gonna do is every single

127
00:10:18,021 --> 00:10:23,074
time we come to a new loop iteration,
we're going to change the RRB, and it's

128
00:10:23,074 --> 00:10:27,016
going to point to a different set of
registers.

129
00:10:27,016 --> 00:10:32,075
And we can actually effectively software
pipeline just by using this one, one

130
00:10:32,075 --> 00:10:36,817
feature.
So, here we have the same code sequence we

131
00:10:36,817 --> 00:10:41,022
had from last lecture, so is the, the
previous code example.

132
00:10:41,022 --> 00:10:48,681
And if we recall, when we unrolled all of
this, what we ended up with was a load, an

133
00:10:48,681 --> 00:10:53,276
add, a store.
We'll talk about this in a second.

134
00:10:53,276 --> 00:10:59,384
This was kind of the, the key thing that
we were trying to execute, and if we have

135
00:10:59,384 --> 00:11:03,092
to unroll this we just had to unroll the
code and then look at the dependencies.

136
00:11:03,092 --> 00:11:11,027
So, let's look at the dependencies here.
Well, dependencies that we're gonna have

137
00:11:11,027 --> 00:11:18,053
is, this load writes F1, or, the floating
point register F1 here.

138
00:11:18,053 --> 00:11:23,093
And.
We know that this is actually getting one

139
00:11:23,093 --> 00:11:27,024
get read.
Let's say, the leniency of this is one,

140
00:11:27,024 --> 00:11:30,048
two, three, cycle and doesn't get read til
here.

141
00:11:31,053 --> 00:11:37,729
Likewise, this add here computes P10 and
the add, let's say, is a floating point

142
00:11:37,729 --> 00:11:45,012
add, it has some long latency and down
here is when it's ready into the store.

143
00:11:47,001 --> 00:11:52,069
So on something like a, a team of rotating
register file, we don't actually generate

144
00:11:52,069 --> 00:11:56,000
all this code.
Instead, we generate one instruction.

145
00:11:56,000 --> 00:12:01,021
This which is going to take care of our
epilogue, our prologue, our prologue, our

146
00:12:01,021 --> 00:12:06,044
epilogue, and the main loop.
And, what we're gonna do is we encode the

147
00:12:06,044 --> 00:12:11,039
distance in register numbers between these
two values here.

148
00:12:11,039 --> 00:12:18,019
So, what this means is, if this writes F1,
and one, two, three loop iterations in the

149
00:12:18,019 --> 00:12:24,091
future wanna read that value, we encode
that here with a register number that is

150
00:12:24,091 --> 00:12:28,054
that number off.
And then likewise here.

151
00:12:28,054 --> 00:12:33,011
So this would be F1 to F4, because, it's
off by three.

152
00:12:33,011 --> 00:12:38,008
And here, this writes F5.
And we know this one's to be read, one,

153
00:12:38,008 --> 00:12:43,852
two, three, four later, so we encoded it
with a register number that's forward into

154
00:12:43,852 --> 00:12:47,086
the future.
And now we're going to talk about this

155
00:12:47,086 --> 00:12:51,398
instruction here.
So what this is going to do is it's going

156
00:12:51,398 --> 00:12:57,121
to change the routine register base number
or the RRB, and it's going to bump it by

157
00:12:57,121 --> 00:13:00,968
one.
So we can basically just keep branching to

158
00:13:00,968 --> 00:13:04,683
itself here.
And each time we do it, the, all the

159
00:13:04,683 --> 00:13:10,532
registers going to change names.
So by the time this is ready, or by the

160
00:13:10,532 --> 00:13:15,062
time the load is ready here.
These other values will have sort of

161
00:13:15,062 --> 00:13:19,123
caught up with it, where the physical
register that they're actually going to

162
00:13:19,123 --> 00:13:21,650
look at will now point to the correct
location.

163
00:13:21,650 --> 00:13:26,016
So we can effectively encode into one
instruction here all of this, including

164
00:13:26,016 --> 00:13:29,314
the prologue and the epilogue, using this
rotating register file.

165
00:13:29,314 --> 00:13:34,832
Okay, so last, last slide of today.
Why do I think Itanium, I think we can

166
00:13:34,832 --> 00:13:40,516
pretty confidently say, failed?
I actually don't think it was a lot of the

167
00:13:40,516 --> 00:13:43,583
ideas.
I think some of it, a lot of it had to do

168
00:13:43,583 --> 00:13:48,249
with the implementation.
So, first off, if you tied the hands of

169
00:13:48,249 --> 00:13:56,657
the micro architect, they're gonna scream.
So, I64 added a lot of architectural,

170
00:13:56,657 --> 00:14:00,439
big-A architecture, ISA level features, in
order to get specular parallelism.

171
00:14:00,439 --> 00:14:05,044
And a lot of this stuff was implemented
and talked about, but never actually built

172
00:14:05,044 --> 00:14:08,104
into real processors.
So, people didn't go through the effort,

173
00:14:08,104 --> 00:14:12,246
until basically, the first Itanium, to try
to implement some of these things, and

174
00:14:12,246 --> 00:14:15,706
they didn't all mix well together.
And they added a lot of states, and they

175
00:14:15,706 --> 00:14:17,722
added a lot of complexity to the
processor.

176
00:14:17,722 --> 00:14:22,076
So, we have a-lat, full predication or
almost full predication, routine register

177
00:14:22,076 --> 00:14:25,550
files to name a few.
This is really complex bundling sequence,

178
00:14:25,550 --> 00:14:30,590
the, probably one of the hardest to decode
instruction sets in the world.

179
00:14:30,590 --> 00:14:36,062
Very, very challenging and this was a big,
a big a big challenge and it, it

180
00:14:36,062 --> 00:14:41,087
type-hands the micro-architect, and the
micro-architect couldn't make a decision.

181
00:14:41,087 --> 00:14:47,082
So a good example of this, a funny, funny
story here is that after the DEC Alpha

182
00:14:47,082 --> 00:14:53,210
employees, Digital Equipment Corportation
employees, left DEC, they were sort of

183
00:14:53,210 --> 00:14:58,222
assumed into a part of Intel.
That same team that used to build out of

184
00:14:58,222 --> 00:15:04,367
order alpha processors, went on to go
build sort of the next, next generation of

185
00:15:04,367 --> 00:15:08,535
an Itanium processor.
And what they said they wanted to go look

186
00:15:08,535 --> 00:15:12,580
at the Itanium processor.
And like wow, this is really complicated.

187
00:15:12,580 --> 00:15:15,247
It took'em much more complicated than
alpha.

188
00:15:15,247 --> 00:15:19,276
And then they said oh, well, we could
probably do better if we just built it out

189
00:15:19,276 --> 00:15:24,234
of order superscalar, took apart all of
the instructions, took apart all of the

190
00:15:24,234 --> 00:15:28,843
dependencies, poured that data into what
was effectively a alpha out of order

191
00:15:28,843 --> 00:15:33,271
superscalar core and then execute it.
And what was funny, if you're going to

192
00:15:33,271 --> 00:15:37,995
look at this, as like all the, the, you
can sit there and just bang your head

193
00:15:37,995 --> 00:15:42,078
because you did all of this work and added
all of this architectural state.

194
00:15:42,078 --> 00:15:45,008
To allow the compiler to do all this, this
work.

195
00:15:45,008 --> 00:15:47,037
And then they would just wanted to undo it
all.

196
00:15:47,052 --> 00:15:51,042
They would do this because they wanted
performance but then they wanted to undo

197
00:15:51,042 --> 00:15:55,042
all of the sort of state and all of this
hard work the compiler did and just redo

198
00:15:55,042 --> 00:15:59,004
it all academically because they though
they could get better performance.

199
00:15:59,004 --> 00:16:02,040
They probably could have.
It probably was a good idea but what was

200
00:16:02,040 --> 00:16:05,082
kind of funny there is you built a
instruction set that had one micro

201
00:16:05,082 --> 00:16:09,094
architecture in mind?
Basically an in order architecture.

202
00:16:10,032 --> 00:16:14,038
And then, all of a sudden, people are
thinking about building out of order

203
00:16:14,038 --> 00:16:17,050
variants of it.
And it sort of throws everything you had

204
00:16:17,050 --> 00:16:20,034
before away, or all these notions sort of
went away.

205
00:16:20,065 --> 00:16:24,043
So it's just a, just a funny story that,
that you know people try to build out of,

206
00:16:24,043 --> 00:16:26,737
out of our versions.
They ultimately not, do not do, end up

207
00:16:26,737 --> 00:16:29,047
doing that.
That same team decided it was basically

208
00:16:29,047 --> 00:16:32,073
too hard, mostly due to predicate
registers, and sort of how to bypass

209
00:16:32,073 --> 00:16:34,076
predicate registers of out of order
things.

210
00:16:34,076 --> 00:16:38,058
And I think they ultimately ended up not,
not doing that, or they definitely ended

211
00:16:38,058 --> 00:16:42,086
up not doing that.
And that's just what's sort of known now

212
00:16:42,086 --> 00:16:49,023
is the, Wachusett, or excuse me, not the
Wachusett, it's known as the Tukwila

213
00:16:49,023 --> 00:16:53,007
processor from Intel.
Now there are other couple of problems

214
00:16:53,007 --> 00:16:55,064
here.
First implementation had very low clock

215
00:16:55,064 --> 00:16:59,077
rate, so your first one out the gate was
just not very good, this just sort of

216
00:16:59,077 --> 00:17:02,013
hurt.
And it was, it's hard to build these

217
00:17:02,013 --> 00:17:04,086
things.
They're wide, the speed demons versus the

218
00:17:04,086 --> 00:17:09,020
sort of brainiacs, this is this question
of do you want to go wide, or do you want

219
00:17:09,020 --> 00:17:12,057
to go long and narrow.
Long and narrow was doing okay at the

220
00:17:12,057 --> 00:17:15,061
time.
Big code-size bloat, fundamentally did not

221
00:17:15,061 --> 00:17:21,035
solve all the dynamic scheduling problems
that out of order superscalar could get

222
00:17:21,035 --> 00:17:24,078
at.
So for instance branching or changing your

223
00:17:24,078 --> 00:17:31,024
instruction schedule based on, based on
whether a load hit or miss in the cache,

224
00:17:31,024 --> 00:17:35,053
it couldn't do.
Big compiler complexity, need profiling,

225
00:17:35,053 --> 00:17:41,003
and not every one wanted to profile.
There's also just not that much in static

226
00:17:41,003 --> 00:17:45,048
level, static instructionable parallelism
in all programs, so the compiler couldn't

227
00:17:45,048 --> 00:17:49,422
necessarily find all the parallelism, or
it wasn't there statically, and if you're

228
00:17:49,422 --> 00:17:52,082
going for a compiler only approach, you
need to be able to do that.

229
00:17:52,082 --> 00:17:56,063
And then, this is what really killed it
here is, the, people did go build these

230
00:17:56,063 --> 00:18:00,016
more complex out of order superscalars.
So at the time, there was this big

231
00:18:00,016 --> 00:18:02,054
discussion.
Can we build more complex out of order

232
00:18:02,054 --> 00:18:05,002
superscalars?
And people said, no, those are too hard,

233
00:18:05,002 --> 00:18:08,007
they're too hard to build.
They take too much, they cost too much.

234
00:18:08,007 --> 00:18:10,026
We don't know how to solve all these
problems.

235
00:18:10,026 --> 00:18:14,017
So instead, we'll try to build something
simpler, and push a lot of complexity into

236
00:18:14,017 --> 00:18:15,002
the compiler.
Well.

237
00:18:15,002 --> 00:18:20,028
There was money behind this question.
So people went and did build these complex

238
00:18:20,028 --> 00:18:24,081
out-of-order superscalars.
And, that's what we're basically still

239
00:18:24,081 --> 00:18:27,083
using today in our sort of desktop
processors.

240
00:18:27,083 --> 00:18:33,081
We have out-of-order superscalars today.
And then finally, the last, last big one

241
00:18:33,081 --> 00:18:36,005
here, AMD64 happened.
What is AMD64?

242
00:18:36,005 --> 00:18:40,005
Well, it's a 64 bit extension to X-86, AMD
originally did this.

243
00:18:40,005 --> 00:18:44,092
Intel, after sort of dragging their feet
for a couple, couple years on this,

244
00:18:44,092 --> 00:18:48,432
finally decided, oh.
We're going to, we're going to use that,

245
00:18:48,432 --> 00:18:53,099
because people wanted this.
People wanted code compatibility with the

246
00:18:53,099 --> 00:18:59,785
ability to 64 bit sort of wider, both
arithmetic operations and wider address,

247
00:18:59,785 --> 00:19:05,606
addressing, so more amounts of memory.
And 64 bits is a lot of memory.

248
00:19:05,606 --> 00:19:12,125
So AMD originally came up with this, this
is now known as I believe EMT 64 or Intel

249
00:19:12,125 --> 00:19:17,835
64, not to be confused with IA 64, that's
what Intel calls now these 64 bit

250
00:19:17,835 --> 00:19:23,764
extension x86, and now Intel is building
those processors too So, everyone as of

251
00:19:23,764 --> 00:19:30,314
jumped on that, and that's, and Intel has
kind of de-emphasize Itanium now, Itanium

252
00:19:30,314 --> 00:19:35,151
instruction set and instead, we are
basically sticking with IA64 and this

253
00:19:35,151 --> 00:19:40,124
instruction, or it's going to be IA32, the
32 bit x86 with extension 64 bit, you

254
00:19:40,124 --> 00:19:43,594
know, that's taken over the work,
workstation market.

255
00:19:43,594 --> 00:19:49,543
And what's kind of funny here is, this
was, this processor was really designed to

256
00:19:49,543 --> 00:19:55,510
kill or unify all the workstation vendors
together under one processor that was

257
00:19:55,510 --> 00:19:58,758
going to beat them all.
And it, and it did it's goal to some

258
00:19:58,758 --> 00:20:04,161
extent, Because this processor was coming
around, either company's went out of

259
00:20:04,161 --> 00:20:09,202
business, or they jumped on the IA64
bandwagon, and decided they were going to

260
00:20:09,202 --> 00:20:12,738
take that on.
But what replaced it, what replaced all

261
00:20:12,738 --> 00:20:17,855
the different little variants of
processors that were in workstations.

262
00:20:17,855 --> 00:20:23,503
So Spark, a, PA Risk for HP, SG, SGI sort
of MIPS processors, did I already say

263
00:20:23,503 --> 00:20:26,888
Spark?
All these sort of different things and

264
00:20:26,888 --> 00:20:31,011
powered by IBM.
Power is still around but a lot of the

265
00:20:31,011 --> 00:20:36,095
other ones died through attrition or moved
on to I or were supposed to move on to

266
00:20:36,095 --> 00:20:40,043
IA64.
But IA64 was, did not end up winning this.

267
00:20:40,043 --> 00:20:43,099
Instead we replaced it with 64 bit XA6
processors.

268
00:20:43,099 --> 00:20:49,073
So it sort of did its job it killed the,
killed the workstation processors, but

269
00:20:49,073 --> 00:20:53,083
replaced it with not itself, ended up
replacing it with something else.

270
00:20:53,083 --> 00:20:57,637
Anyway, we're gonna, we're gonna stop here
for today, and we'll, we'll talk more next