1
00:00:04,780 --> 00:00:08,615
Okay. So now, we get into some more fun
pictures here.

2
00:00:08,615 --> 00:00:14,515
So, let's take a look at a processor based
on the in-order processors we've been

3
00:00:14,515 --> 00:00:19,434
looking at up to this point.
We fetch the code, we added or renamed

4
00:00:19,434 --> 00:00:24,855
some stage here, which we will talk about
in a second, and then we have different

5
00:00:24,855 --> 00:00:28,709
pipelines, and one right back.
We have two register files.

6
00:00:28,709 --> 00:00:33,178
We have a scalar register file, which is
what we're calling our old register file.

7
00:00:33,178 --> 00:00:38,480
And we have the vector register file,
Which has lots and lots of data in it.

8
00:00:40,120 --> 00:00:45,614
The vector length register sort of, sits
out in front here in our register fetch

9
00:00:45,614 --> 00:00:51,040
stage. And we, we, we make a special stage
for this because it does a lot of work.

10
00:00:51,440 --> 00:00:55,560
What is, what is it going to do?
Well, when an operation gets to this

11
00:00:55,560 --> 00:00:58,908
stage, it's going to start reading the
register file.

12
00:00:58,908 --> 00:01:03,607
And if you have vector registers, it's
going to sit there and read the first

13
00:01:03,607 --> 00:01:07,020
element, the second element, the third
element in time.

14
00:01:07,340 --> 00:01:10,277
And, start shoving it down one of the
pipes.

15
00:01:10,277 --> 00:01:14,787
So, let's say, this is our multiply pipe
here, it's four stages long.

16
00:01:14,787 --> 00:01:18,339
And we do a multiply of 60 a vector length
of 64.

17
00:01:18,339 --> 00:01:22,438
It's going to send, its going to do 64
sequential reads out of here,

18
00:01:22,438 --> 00:01:26,060
And then six, 64 instructions down this
multiply pipe.

19
00:01:26,760 --> 00:01:31,200
Note, we are not looking at any
parallelism yet in this example.

20
00:01:31,620 --> 00:01:36,448
We're doing everything sequentially here.
One thing you will note is, instructions

21
00:01:36,448 --> 00:01:40,680
can stop here and they, basically,
generate more, more work at that point.

22
00:01:40,680 --> 00:01:45,270
And, and up to the vector, maximum vector
length or whatever the vector length

23
00:01:45,270 --> 00:01:50,339
register's currently set at.
So, let's look at, some a basic operation

24
00:01:50,339 --> 00:01:53,245
here.
We're going to have a piece of code which

25
00:01:53,245 --> 00:01:57,669
is doing the same operation.
We're taking vector A and vector B, and

26
00:01:57,669 --> 00:02:00,244
element Y is multiplying them.
And, i is four.

27
00:02:00,244 --> 00:02:03,348
I
Use a small i here because it's hard to

28
00:02:03,348 --> 00:02:08,169
draw these things with i of 64.
[LAUGH] Just lots of instructions to draw.

29
00:02:08,169 --> 00:02:13,518
If we go look at the assembly code, it's
the same assembly code we had before but

30
00:02:13,518 --> 00:02:16,820
we load the vector length with four
instead of 64.

31
00:02:17,120 --> 00:02:25,140
So, load vector, load vector, multiply
vector, vector, double and store.

32
00:02:27,269 --> 00:02:32,200
Let's look at the second load multiply and
store. I don't have the second load, or

33
00:02:32,200 --> 00:02:35,840
the load immediate on here cuz it just
took up too much space.

34
00:02:37,460 --> 00:02:42,364
This load vector here, it's going to start
off, it's going to fetch, decode, and then

35
00:02:42,364 --> 00:02:47,591
it's usually going to sit at the R stage
for a while inserting loads, loads, loads,

36
00:02:47,591 --> 00:02:51,920
loads down the pipeline.
Okay.

37
00:02:52,420 --> 00:03:00,375
This basic vector of execution, we don't
have any bypassing, and we do register

38
00:03:00,375 --> 00:03:08,520
dependency checking, on the through a
register file and on a whole registers.

39
00:02:21,390 --> 00:03:11,600
So, we install instruction if the whole
vector is not ready.

40
00:03:11,860 --> 00:03:17,646
Now, we're going to look at ways to get
that better in a second. But, what that

41
00:03:17,646 --> 00:03:23,504
means is, we wait for all of the values of
this load to write back to the register

42
00:03:23,504 --> 00:03:29,220
file before we go and start to do the
register fetches of the next instructions.

43
00:03:31,000 --> 00:03:35,360
Then, we do the multiplies, multiplies
have longer pipe line lengths.

44
00:03:36,020 --> 00:03:41,961
We wait for all of those multiplies to
write to the register file before we start

45
00:03:41,961 --> 00:03:56,004
to go do, the storage operations.
So, score boarding here is, is very score

46
00:03:56,004 --> 00:04:00,579
boarding in our bypassing here are very,
very limited. They're, they, you can have,

47
00:04:00,579 --> 00:04:05,998
basically, a check to make sure everything
is in the register file before you before

48
00:04:05,998 --> 00:04:09,165
you go ahead.
I do want to introduce one piece of

49
00:04:09,165 --> 00:04:13,816
nomenclature.
And, your book calls this out,

50
00:04:13,816 --> 00:04:21,061
It's called a chime.
So, a chime is how long it takes to

51
00:04:21,061 --> 00:04:29,780
execute one vector instruction one the
architecture.

52
00:04:30,040 --> 00:04:34,507
So, what we're going to see in a little
bit is we're going to have some

53
00:04:34,507 --> 00:04:40,378
architectures where you can actually
overlap some portion of the execution.

54
00:04:40,602 --> 00:04:46,197
Let's say, they're having different
functional units, and decrease the chime.

55
00:04:46,197 --> 00:04:51,719
So, for this architecture here, the chime
is, is four cuz it takes, basically,

56
00:04:51,868 --> 00:04:55,968
occupancy of four for, for a vector length
of, of, of four.

57
00:04:55,971 --> 00:05:01,120
And, we only have one ALU effectively that
can be used at a time here.

58
00:05:01,960 --> 00:05:06,480
So now, let's take a look at how to, how
to make things run a little bit faster.

59
00:05:06,480 --> 00:05:10,305
We explained null parallelism.
The only real advantage we've taken

60
00:05:10,305 --> 00:05:13,667
advantage of here is we've decreased our
memory bandwidth.

61
00:05:13,667 --> 00:05:18,245
We've not actually or decreased our
instruction fetch and instruction fetch

62
00:05:18,245 --> 00:05:22,302
memory bandwidth. But, that's not a real
great reason to go do anything.

63
00:05:22,302 --> 00:05:25,780
It probably reduces power a little bit,
But we want to go fast.

64
00:05:27,440 --> 00:05:30,460
So, we can start to think about how to
overlap.

65
00:05:30,740 --> 00:05:39,186
If we, if we have different functional
units, x0, the load, the ALU, the load,

66
00:05:39,186 --> 00:05:47,503
the store, and the multiply unit, we can
start to think about overlapping them, in

67
00:05:47,503 --> 00:05:53,654
both space and time.
So, here's an example where we're

68
00:05:53,654 --> 00:05:59,167
executing 32 elements,
Is our vector length.

69
00:05:59,167 --> 00:06:04,500
And, we actually have multiple copies of
function units,

70
00:06:04,760 --> 00:06:09,664
And we have multiple function units. So,
we can overlap different executions of

71
00:06:09,664 --> 00:06:13,940
each other and we'll look at some more
detailed examples in a second.

72
00:06:14,420 --> 00:06:19,220
So, we can start to add parallelism by
using multiple units at the same time.

73
00:06:19,620 --> 00:06:25,051
Our adder unit, our multiply unit, and our
load unit, which was, this was L and this

74
00:06:25,051 --> 00:06:30,282
was our Y and this was our X units,
And we can actually put multiple copies of

75
00:06:30,282 --> 00:06:33,300
those. And we'll, we're going to call
those lanes.