1
00:00:03,060 --> 00:00:07,028
Okay.
So let's, let's talk about some of the

2
00:00:07,028 --> 00:00:13,061
other things that were on our list of
limiters to instruction level parallelism.

3
00:00:13,061 --> 00:00:18,012
And ask ourselves, are these things which
we can solve very easily?

4
00:00:18,012 --> 00:00:21,095
Well, things start to get harder pretty,
pretty fast here.

5
00:00:22,015 --> 00:00:26,087
Like I said, dynamic events.
They're basically things the compiler has

6
00:00:26,087 --> 00:00:30,090
no way of reacting to.
Now, the instruction set might not be,

7
00:00:30,090 --> 00:00:35,062
might be able to react to it.
You can add things into the instruction

8
00:00:35,062 --> 00:00:38,042
set.
You may take a branch dependent on a

9
00:00:38,042 --> 00:00:41,097
dynamic event.
But let's look at some dynamic events.

10
00:00:42,047 --> 00:00:49,050
First negative event here, cache miss.
Well, it's really for compiler, just, all

11
00:00:49,050 --> 00:00:55,012
things being equal, knowing whether a load
is going to take a cache miss or not.

12
00:00:55,032 --> 00:01:00,029
You might have some guesses, that could
influence the code, but, you know,

13
00:01:00,029 --> 00:01:05,019
actually doing something about it.
If you think about what an out-of-order

14
00:01:05,019 --> 00:01:09,095
super scalar does, is if you take a cache
miss, it'll reschedule the code.

15
00:01:11,017 --> 00:01:14,049
Around the cache miss.
It'll take the non dependent instructions

16
00:01:14,049 --> 00:01:18,038
that, the instructions that are not
dependent on load, and pull those up and

17
00:01:18,038 --> 00:01:21,045
try to execute those.
But, if the load hits in the cache, you

18
00:01:21,045 --> 00:01:24,098
want a totally different schedule.
You want to actually try to start

19
00:01:24,098 --> 00:01:28,061
executing the instructions that are
dependent on that load as soon as

20
00:01:28,061 --> 00:01:32,041
possible.
So an out of order superscalar has dynamic

21
00:01:32,041 --> 00:01:36,070
means to be able to do this, has dynamic,
instruction execution.

22
00:01:36,070 --> 00:01:41,030
But our Wiehl IW processor, which is
statically scheduled, can't really do

23
00:01:41,030 --> 00:01:44,008
this.
So what's, what's some techniques to, to

24
00:01:44,008 --> 00:01:48,005
go about after this?
One thing is something called informing

25
00:01:48,005 --> 00:01:50,089
loads.
As far as I know, this actually has not

26
00:01:50,089 --> 00:01:55,062
been built in any hardware, but it's been
proposed, at least in the computer

27
00:01:55,062 --> 00:02:02,017
architecture, Academic circles or in, in
the computer architecture literature.

28
00:02:02,038 --> 00:02:08,086
And informing loads is actually come up
the, the original paper about this by one

29
00:02:08,086 --> 00:02:14,055
of our faculty member or one of our
faculty members here at Princeton Margaret

30
00:02:14,055 --> 00:02:19,002
Martonosi wrote a, a, she's one of the
authors of this paper.

31
00:02:19,002 --> 00:02:25,006
But the basic idea is that if you have a,
a load that misses in the cache you can

32
00:02:25,006 --> 00:02:30,016
not execute subsequent instructions.
L seeking next actually no.

33
00:02:30,016 --> 00:02:34,086
If there, the load is missing the cache
you just don't execute the subsequent

34
00:02:34,086 --> 00:02:38,003
instructions, you basically nullify those
instructions.

35
00:02:38,003 --> 00:02:43,003
And this allows you to change the code
sequence dependent on whether the load

36
00:02:43,003 --> 00:02:46,016
hits in the cache or whether it misses the
cache.

37
00:02:46,016 --> 00:02:51,016
And this was done with Todd Maury
[unknown] Professor [unkown], and

38
00:02:51,016 --> 00:02:56,053
professor Todd Maury I think both when
they were graduate students, and Mark

39
00:02:56,053 --> 00:03:01,085
Horwitz from Stanford was on his paper, I
think a couple, trying to remember who was

40
00:03:01,085 --> 00:03:04,054
the last author, the, the professor on
this.

41
00:03:04,073 --> 00:03:07,093
On this work.
Another option is something that the

42
00:03:08,012 --> 00:03:12,395
Elbrus, Elbrus processor people did.
So you probably never heard of this

43
00:03:12,395 --> 00:03:15,783
processor.
It was something that was built.

44
00:03:15,783 --> 00:03:22,605
In the, well, actually don't know who's
ever finished so you get a prototype of it

45
00:03:22,605 --> 00:03:28,082
sort of in the Soviet Russia Day right
when the Soviet Union was kind of

46
00:03:28,082 --> 00:03:32,044
breaking, breaking down.
This was the design house in Soviet Russia

47
00:03:32,044 --> 00:03:36,029
which made all of these sort of military
processors.

48
00:03:36,029 --> 00:03:41,017
They later went off and that same design
team was gonna go build some commercial

49
00:03:41,017 --> 00:03:43,094
processors after the fall of the Soviet
Union.

50
00:03:43,094 --> 00:03:48,058
So, [laugh], they went to, you know, they
went to go to build this processor, and

51
00:03:48,058 --> 00:03:53,069
it's a VLIW processor, it's some very long
instruction word processor, and they had

52
00:03:53,069 --> 00:03:59,707
an instruction in there which tried to
solve this dynamic event, probable around

53
00:03:59,707 --> 00:04:04,357
cache misses.
So, what it said is if the instruction

54
00:04:04,357 --> 00:04:15,590
misses in the cache.
Go execute a different piece of code with

55
00:04:15,590 --> 00:04:29,043
a different schedule than if the
instructions if a load hit in the cache.

56
00:04:29,059 --> 00:04:29,851
So you could effectively have two
difference codes here because the compiler

57
00:04:29,851 --> 00:04:30,650
could actually generate two different code
sequences and it can get back while the

58
00:04:30,650 --> 00:04:32,003
performance and get back to almost exactly
what [inaudible] could do.

59
00:04:32,003 --> 00:04:32,053
This processor never really made it
commercially.

60
00:04:32,053 --> 00:04:36,052
And later, they, the company went under.
It was bought, I actually don't know if

61
00:04:36,052 --> 00:04:40,072
the assets were bought by Intel but at
least all the, the people who worked at

62
00:04:40,072 --> 00:04:43,003
this company now work at Intel,
effectively.

63
00:04:43,003 --> 00:04:46,164
So they still live in Russia.
So, that's a funny story there, that,

64
00:04:46,164 --> 00:04:50,257
Yeah, it didn't work out from a, a
commercial perspective, but they had some

65
00:04:50,257 --> 00:04:54,075
cool ideas in there.
That you could actually try to schedule

66
00:04:54,075 --> 00:04:56,910
around branch or, schedule around cache
misses.

67
00:04:56,910 --> 00:05:00,133
Okay, so some other things.
Branch mispredicts.

68
00:05:00,133 --> 00:05:03,109
Well, we already talked about one
technique here.

69
00:05:03,109 --> 00:05:08,425
We can talk about you can add predication.
But that doesn't help if you have big

70
00:05:08,425 --> 00:05:13,767
pieces of code, big code sequences and big
code hammocks, you can't necessarily

71
00:05:13,767 --> 00:05:18,654
predicate your entire program.
So what do you, what do you do instead?

72
00:05:18,654 --> 00:05:22,072
Well.
One, one solution that people have come up

73
00:05:22,072 --> 00:05:25,599
with for this.
And this is, this is the hard one, hard

74
00:05:25,599 --> 00:05:29,439
one to deal with here.
Is you can add branch delay slots.

75
00:05:29,439 --> 00:05:33,222
So you have a VWA processor, let's say
it's three wide.

76
00:05:33,222 --> 00:05:38,668
And you add branch delay slots in your
instruction set and you use predication in

77
00:05:38,668 --> 00:05:43,992
the branch delay slots.
And what this allows you to do is, it

78
00:05:43,992 --> 00:05:50,119
effectively lets you mask some of the
branch mis-predict penalty and change the

79
00:05:50,119 --> 00:05:54,443
code schedule, a little bit, dependent on
the prediction.

80
00:05:54,443 --> 00:06:00,904
Or change is actually completely based on
where the branch is, and the way this

81
00:06:00,904 --> 00:06:08,184
works is, in the, in the delay slots, you
use the same predicate that the branch is

82
00:06:08,184 --> 00:06:11,448
branching on.
So the branch is branching on let's say if

83
00:06:11,448 --> 00:06:14,436
A equals B.
We'll use that same thing in the

84
00:06:14,436 --> 00:06:19,655
predication and effectively you can
reschedule the code differently, depending

85
00:06:19,655 --> 00:06:22,561
on whether the branch is taken or not
taken.

86
00:06:22,561 --> 00:06:26,317
And because you're putting in the delay
slot.

87
00:06:26,317 --> 00:06:30,696
You can sort of get around some of the
problems here of whether it's, taking the

88
00:06:30,696 --> 00:06:33,030
one direction or taking the other
direction.

89
00:06:33,063 --> 00:06:38,037
No matter of which way the pred, which way
the mis-predict happens or, whether the

90
00:06:38,037 --> 00:06:40,071
branches can be predicted correctly or
not.

91
00:06:40,071 --> 00:06:45,006
And, and what we're doing is basically
putting code in the delay slots that will

92
00:06:45,006 --> 00:06:49,070
always execute, but it can be predicated.
And, if you can actually pull code up from

93
00:06:49,070 --> 00:06:53,089
the two destinations of the two branches.
So it's sort of a way to get around,

94
00:06:54,005 --> 00:06:58,054
branch mis-predict [inaudible] here.
This is actually done in a research

95
00:06:58,054 --> 00:07:03,021
process that was built at MIT.
And I think it's probably also done in

96
00:07:03,040 --> 00:07:08,013
some of the other HP processor.
But the MIT processor at least was called

97
00:07:08,013 --> 00:07:13,057
the M machine out of Bill Dali's group.
He's now at Stanford but I think the ATM

98
00:07:13,057 --> 00:07:17,072
machine was built right when he was moving
from MIT to Stanford.

99
00:07:17,072 --> 00:07:22,006
But they had I think three delay slots and
they were three wide.

100
00:07:22,006 --> 00:07:26,001
And they could predicate the instructions
in the delay slots.

101
00:07:26,001 --> 00:07:28,095
So last thing is exceptions.
So you take a exception.

102
00:07:28,095 --> 00:07:32,018
You want to schedule different code.
And these are, like, impossible to

103
00:07:32,018 --> 00:07:34,093
predict, you, the compiler has no way to
try to predict this.

104
00:07:34,093 --> 00:07:38,054
But this is hard on a super-scalar also.
You know, when you have a traditional

105
00:07:38,054 --> 00:07:42,009
super-scalar, and an exception happens,
they usually end up flushing the pipe

106
00:07:42,009 --> 00:07:44,029
anyway.
So, and, and it doesn't happen that often.

107
00:07:44,029 --> 00:07:46,072
It probably doesn't hurt your performance
that much.

108
00:07:46,072 --> 00:07:49,057
So you, no one's gonna lose, lose too much
sleep over this one.

109
00:07:50,025 --> 00:07:56,098
So briefly I wanted to say something about
how, how to build really wide VLIWs.

110
00:07:56,098 --> 00:08:05,301
As we start to go to wider and wider,
VLIWs lots of instructions execute at the

111
00:08:05,301 --> 00:08:09,658
same time.
You have to start thinking about what does

112
00:08:09,658 --> 00:08:14,006
the register file and the by-pass number
look like.

113
00:08:14,006 --> 00:08:21,432
So in this drawing here we actually have a
figure of the C64000 Series Processors.

114
00:08:21,432 --> 00:08:26,267
So these are TIDSPs sort of the flagship
of DSPs processors.

115
00:08:26,267 --> 00:08:31,191
And this is a sort of block level diagram
of what they, what they have.

116
00:08:31,191 --> 00:08:36,250
What they actually have is, they have,
they have divided the machine into local

117
00:08:36,250 --> 00:08:39,714
register files.
When they bypass with inside of what's

118
00:08:39,714 --> 00:08:44,152
called the clusters.
So is the clusterd VIWS similar to how we

119
00:08:44,152 --> 00:08:49,470
have clustered superscalars which also
divide the register file but this is a

120
00:08:49,470 --> 00:08:54,277
architectural, big architectural or isale
of architecture splitting or dividing

121
00:08:54,277 --> 00:09:00,026
going on here.
So in something like the C6400, they

122
00:09:00,026 --> 00:09:07,320
actually have four instructions per
cluster, so they are executing eight

123
00:09:07,320 --> 00:09:14,087
instructions at the same time.
And you can bypass values between these

124
00:09:14,087 --> 00:09:17,522
four ALUs.
Within, within them, or you can, bypass

125
00:09:17,522 --> 00:09:21,964
within these, but if you wanna sort of
take a value from here and move it to

126
00:09:21,964 --> 00:09:26,284
there, you have a very low bandwidth, sort
of, bypass path here and it takes an

127
00:09:26,284 --> 00:09:29,025
instruction.
You effectively have to have a move

128
00:09:29,025 --> 00:09:32,014
instruction to move between the two
different clusters.

129
00:09:32,014 --> 00:09:36,193
So there's a, there's a lower bandwidth
between the clusters and a higher latency

130
00:09:36,193 --> 00:09:39,467
between the clusters, but inside of a
cluster, it's very fast.

131
00:09:39,467 --> 00:09:43,396
And what's important to know here is these
are not two processors.

132
00:09:43,396 --> 00:09:48,259
These are all executing one instruction at
the same time, so it's a eight wide

133
00:09:48,259 --> 00:09:52,240
instruction executing under this eight
different L use.

134
00:09:52,240 --> 00:10:00,571
And this is used in the sort of TI.
High end DSPs also is its used in HPs and

135
00:10:00,571 --> 00:10:05,510
STMicro's LX processor.
This a probably a processor you never

136
00:10:05,510 --> 00:10:10,153
heard about, but this was actually what
Josh Fisher went on to go do at HP Labs

137
00:10:10,153 --> 00:10:13,426
after multi-flow.
So, this is after sort of some of the

138
00:10:13,426 --> 00:10:17,698
original VLIW work same, same person went
and built this LX processor.

139
00:10:17,698 --> 00:10:21,642
And this is sort of joint collaboration
between ST micro and HP.

140
00:10:21,642 --> 00:10:25,871
And this shows up in printers today.
So, it's not something you're gonna highly

141
00:10:25,871 --> 00:10:28,389
have.
The LX processor is probablty something

142
00:10:28,389 --> 00:10:31,032
you're not gonna have on your desktop,
machine.