1
00:00:03,048 --> 00:00:08,455
This is a rich, rich field of, in the
compiler research world, and there's been

2
00:00:08,455 --> 00:00:14,409
a lot of problems with VLIWs, or classical
VLIWs, and people sort of built things

3
00:00:14,409 --> 00:00:20,768
that are somewhere between superscalars
and classical VLIWs to solve some of these

4
00:00:20,768 --> 00:00:24,029
problems.
People come up with fancy compiler

5
00:00:24,029 --> 00:00:30,115
optimizations to solve some of these
problems and some of them are sort of

6
00:00:30,115 --> 00:00:35,681
still open.
First one on my list here, object-code

7
00:00:35,681 --> 00:00:39,970
compatibility.
In the superscalar, because we came up

8
00:00:39,970 --> 00:00:44,603
with one serialized instruction sequence,
and the architecture came up with all the

9
00:00:44,603 --> 00:00:49,019
scheduling, you can change the number of
functional units under the hood in your

10
00:00:49,019 --> 00:00:52,786
micro-architecture of your processor, and
no one none was ever the wiser.

11
00:00:52,786 --> 00:00:57,501
It would still execute the piece of code.
May not be optimal, but it would still

12
00:00:57,501 --> 00:01:00,083
execute.
That's not necessarily the case for

13
00:01:00,083 --> 00:01:02,817
classical VLIWs.
So, you know, you have to recompile the

14
00:01:02,817 --> 00:01:04,910
code when you change the
microarchitecture.

15
00:01:04,910 --> 00:01:09,107
So it's a very tight coupling between the
architecture and microarchitecture because

16
00:01:09,107 --> 00:01:12,730
our instruction encoding now says that
exactly let's say two integer, two multip-

17
00:01:12,730 --> 00:01:16,421
or, two memory operations, and two
floating point operations, or something

18
00:01:16,421 --> 00:01:18,808
like that.
But all of a sudden if you build a machine

19
00:01:18,808 --> 00:01:22,734
which has a different mix, your schedule's
completely wrong, so it's going to have

20
00:01:22,734 --> 00:01:26,580
bad performance, and it's just not going
to execute because the, you probably

21
00:01:26,580 --> 00:01:31,518
changed the instruction encoding standard
when you go and make that different VLIW.

22
00:01:31,518 --> 00:01:39,135
Another big problem is code size.
As you can imagine, there's a fair number

23
00:01:39,135 --> 00:01:46,451
of no-, no operation instructions in, or
no operation operations inside of a VLIW

24
00:01:46,451 --> 00:01:50,925
bundle.
If you can't fill a slot on a superscalar,

25
00:01:50,925 --> 00:01:58,019
you just don't put the instruction.
On a VLIW, if you can't fill the slot, you

26
00:01:58,019 --> 00:02:02,048
have to put a NOOP there, cuz you've got
to put something there.

27
00:02:02,048 --> 00:02:05,089
So this, this, this causes some serious
problems.

28
00:02:05,089 --> 00:02:11,041
Also, things that, hurt this even more is
these fancy techniques that we talked

29
00:02:11,041 --> 00:02:14,062
about.
Loop on rolling, software pipelining, that

30
00:02:14,062 --> 00:02:17,062
bloats the code size.
We're replicating code.

31
00:02:17,062 --> 00:02:20,082
We've unrolled the code.
We're using more space.

32
00:02:20,082 --> 00:02:25,059
This hurts our instruction cache size, and
instruction cache footprint.

33
00:02:27,081 --> 00:02:31,015
We'll talk a little bit more about this in
a few slides.

34
00:02:31,015 --> 00:02:34,068
But, variable-latency operations are very
hard to deal with.

35
00:02:34,068 --> 00:02:39,039
So, if you have a load, you don't know
whether it's gonna take cache miss or not.

36
00:02:39,039 --> 00:02:42,031
So your schedule may be wrong, if you
guessed wrong.

37
00:02:42,031 --> 00:02:46,055
You could do it with some high
probability, you can say, oh, I think this

38
00:02:46,055 --> 00:02:51,008
load usually takes a cache miss, or you
can say, oh, I think this load usually

39
00:02:51,008 --> 00:02:54,059
doesn't take a cache miss.
But we can guess wrong.

40
00:02:54,059 --> 00:02:59,034
This similar sorts of things with, with
sort of branch mispredictions.

41
00:03:03,002 --> 00:03:09,025
Scheduling for statically unpredictable
branches.

42
00:03:10,058 --> 00:03:14,003
This gets very hard.
There are some techniques to solve this.

43
00:03:14,003 --> 00:03:18,057
There is something called predication,
which we will talk about in a few slides,

44
00:03:18,057 --> 00:03:22,008
that helps to solve this.
So you can add things back into the

45
00:03:22,008 --> 00:03:26,080
architecture, into your VLIW architecture,
to deal with sort of short branches that

46
00:03:26,080 --> 00:03:30,025
are very hard to predict, predict or
data-dependent branches.

47
00:03:30,078 --> 00:03:36,077
And as I said, depending on your design,
precise interrupts can be challenging, to

48
00:03:36,077 --> 00:03:40,098
say the least.
If you're actually using the EQ model, you

49
00:03:40,098 --> 00:03:45,070
probably have a hard time figuring out
what to do on a single step, or if you

50
00:03:45,070 --> 00:03:50,078
actually had a branch serve in the middle
while you have a pending operation going

51
00:03:50,078 --> 00:03:52,068
on.
Sort of undefined, it's icky.

52
00:03:52,068 --> 00:03:57,077
It, it's sort of similar to having like,
branch delay slots, and having a fault in

53
00:03:57,077 --> 00:04:01,268
your branch delay slot, what do you really
do with that.

54
00:04:01,268 --> 00:04:05,043
Also, this is a interesting point here, is
that, does the fault.

55
00:04:05,043 --> 00:04:12,002
If you have a fault on let's say, one
operation in a bundle, does the entire

56
00:04:12,002 --> 00:04:17,013
bundle fault, or just the operation fault?
Or take an interrupt.

57
00:04:17,013 --> 00:04:22,188
Typically, the way people implement this
is You have the entire bundle, or the

58
00:04:22,188 --> 00:04:26,392
entire instruction be atomic.
So if anything in the bundle takes a

59
00:04:26,392 --> 00:04:30,779
fault, you actually don't allow any of the
sub operations to commit.

60
00:04:30,779 --> 00:04:34,004
That sounds like the most rational thing
to do.

61
00:04:34,004 --> 00:04:39,056
People have done things in the middle.
I, I probably don't recommend you building

62
00:04:39,056 --> 00:04:43,082
any of those machines.
In the VLIWs that I've always built, an

63
00:04:43,082 --> 00:04:47,042
entire bundle is atomic.
That makes traps a lot easier.

64
00:04:48,080 --> 00:04:51,063
The, cuz actually, well people have not
always done that.

65
00:04:51,063 --> 00:04:55,160
One, one of the interesting cases, if you
think about this, If you have let's say

66
00:04:55,160 --> 00:04:59,235
five instructions in a bundle, and only
one of them faults, maybe you handled that

67
00:04:59,235 --> 00:05:03,269
one, but you let the other ones commit,
and then when you come back you use some

68
00:05:03,269 --> 00:05:06,073
sort of mask to say which ones you need to
re-execute.

69
00:05:06,073 --> 00:05:09,036
People have built things like that, they
get tricky.

70
00:05:11,075 --> 00:05:14,009
Okay.
So let's talk about the rest of today's

71
00:05:14,009 --> 00:05:18,055
lecture and next lecture, we're gonna talk
about techniques to solve a lot of these

72
00:05:18,055 --> 00:05:20,083
problems or solve a lot of these
challenges.

73
00:05:20,083 --> 00:05:24,088
Some of them are compiler, some of them
are hardware, and some of them are both.

74
00:05:26,013 --> 00:05:31,018
First thing people try to do is they try
to come up with compressed instruction

75
00:05:31,018 --> 00:05:35,077
encodings, or fancier instruction
encodings, but when you go to do this, it

76
00:05:35,077 --> 00:05:39,504
makes the front end more complicated.
So here we have, let's say some

77
00:05:39,504 --> 00:05:45,006
instruction, but inside of the
instruction, we can, have different

78
00:05:45,006 --> 00:05:50,076
groups, which, inside of the group
executes parallel, but between groups are,

79
00:05:50,076 --> 00:05:55,031
is not parallel.
So something like the itanium processor,

80
00:05:55,031 --> 00:06:01,062
which was the, or the IA64 processor, from
Intel, actually looks something like that.

81
00:06:01,062 --> 00:06:07,058
Other things you can do is you can have
compressed instruction formats, and then

82
00:06:07,058 --> 00:06:12,087
when you go to actually execute it you
uncompress the no ops into your

83
00:06:13,009 --> 00:06:18,008
instruction memory maybe.
That's what multi-flow trace processor

84
00:06:18,008 --> 00:06:21,081
did.
This marking parallel groups is what I was

85
00:06:21,081 --> 00:06:25,718
talking about before.
Sideroom had an interesting solution to

86
00:06:25,718 --> 00:06:28,832
this.
They actually had a single operation view

87
00:06:28,832 --> 00:06:32,570
like w instruction.
So, to save space they sort of had their

88
00:06:32,570 --> 00:06:37,598
wide instructions but if you had a case
where you were only going to execute one

89
00:06:37,598 --> 00:06:42,659
operation in an instruction, there was a
special encoding format just for that

90
00:06:42,659 --> 00:06:46,359
case.
And that saved a lot of encoding space, or

91
00:06:46,359 --> 00:06:52,044
a lot of instruction space, if you will.
Another example of this actually is a

92
00:06:52,044 --> 00:06:55,758
processor I worked on called the Tilera 64
processor.

93
00:06:55,758 --> 00:06:59,667
We review, 3Y VLIW.
We had an encoding standard, or we have an

94
00:06:59,667 --> 00:07:04,890
encoding standard, which allows you to
execute either two instructions at a time,

95
00:07:04,890 --> 00:07:09,773
or three instructions at a time.
When you're executing two instructions at

96
00:07:09,773 --> 00:07:13,868
a time you have a richer pallet of
instructions you can execute.

97
00:07:13,868 --> 00:07:18,801
So it's sort of a, something in the
middle, it gives you some better code

98
00:07:18,801 --> 00:07:22,507
density benefits.
But not with have, not without, not with

99
00:07:22,507 --> 00:07:29,907
the sort of complexity of having sort of
compressed formats and things like that.

100
00:07:29,907 --> 00:07:37,031
Okay, so that's one way to have, to deal
with the instruction encoding challenges.

101
00:07:37,031 --> 00:07:43,584
One thing you can think about though is,
you just have a bigger I cache or, and a

102
00:07:43,584 --> 00:07:47,024
wider bust from your I cache onto your
memory system.

103
00:07:47,024 --> 00:07:51,074
That does solve a lot of these problems.
It costs hardware, but it's sort of a

104
00:07:51,074 --> 00:07:55,018
simple, stupid solution to the problem
versus a smart solution.

105
00:07:55,018 --> 00:07:59,024
Is, these sort of things on this list here
are, are complex, smart solutions.

106
00:07:59,024 --> 00:08:02,018
Simple solution just have a bigger
instruction cache.

107
00:08:02,018 --> 00:08:05,096
And, if you have bigger code sequences,
you won't feel as, the, the, the

108
00:08:05,096 --> 00:08:08,001
performance hit as, as much from that.