1
00:00:03,340 --> 00:00:09,400
Okay so let's switch topics here. 
We finished our vector processors, 

2
00:00:11,740 --> 00:00:15,100
and we're going to start talking about 
technique. 

3
00:00:15,100 --> 00:00:19,234
That's looking at how to exploit a 
different, different forms of 

4
00:00:19,234 --> 00:00:24,272
parallelism, or thread-level parallelism, 
than we've been looking at up to this 

5
00:00:24,272 --> 00:00:27,866
point. 
So this is not data level parallelism. 

6
00:00:27,866 --> 00:00:33,953
This is actual thread level parallelism 
or process level parallelism in your 

7
00:00:33,953 --> 00:00:39,386
machine. 
So what's the motivation for changing 

8
00:00:39,386 --> 00:00:43,773
your computer architectures to run 
threads very effectively? 

9
00:00:43,773 --> 00:00:48,979
So we've talked about threading or 
multi-programming in the past, where you 

10
00:00:48,979 --> 00:00:54,907
time slice between different processes. 
That's not what this is about. 

11
00:00:54,907 --> 00:01:00,594
This is about either, this is about 
executing multiple programs at the same 

12
00:01:00,594 --> 00:01:05,512
time, or time multiplexing between 
processes on the time slice of the 

13
00:01:05,512 --> 00:01:10,980
instruction at a time. 
So where this really came from is that 

14
00:01:10,980 --> 00:01:15,823
sometimes you can't really extract 
parallelism in the instructional 

15
00:01:15,823 --> 00:01:20,950
parallelism. So this is where out of 
border processors, try to exploit, 

16
00:01:20,950 --> 00:01:25,788
and super Scalars try to exploit. 
And sometimes you can't necessarily 

17
00:01:25,788 --> 00:01:31,117
exploit data level parallelism in a 
program because data level parallelism 

18
00:01:31,117 --> 00:01:37,300
will say it doesn't necessarily work out, 
or you don't have dense matrix sorts of 

19
00:01:37,300 --> 00:01:41,279
codes, which is very easy to find data 
level parallelism. 

20
00:01:41,279 --> 00:01:45,756
Instead, sometimes your workloads have 
thread level parallelism, 

21
00:01:45,756 --> 00:01:51,037
and what we mean by thread level 
parallelism is that either there is 

22
00:01:51,037 --> 00:01:54,742
independent sequential jobs that need to 
happen, 

23
00:01:54,742 --> 00:02:00,680
and an example of these would be you have 
to operate on a million things, 

24
00:02:00,680 --> 00:02:04,995
but the control flow for each of those 
million things is wildly different. 

25
00:02:04,995 --> 00:02:09,776
So you, the, the most efficient way to go 
about doing this is just to actually have 

26
00:02:09,776 --> 00:02:15,097
a process or a job per work unit that you 
need to work on. 

27
00:02:15,097 --> 00:02:18,953
So, a good example of this is network 
processing. 

28
00:02:18,953 --> 00:02:26,040
You are trying to build a fire wall and a 
packet comes in on your network card. 

29
00:02:26,040 --> 00:02:30,652
And your firewall wants to inspect this 
packet and look for, I don't know, 

30
00:02:30,652 --> 00:02:34,127
malicious tacks, 
or it looks for, looks, want, it wants to 

31
00:02:34,127 --> 00:02:39,246
look for something like viruses or it 
wants to look for attack code sequences 

32
00:02:39,246 --> 00:02:41,900
or trojans or, or something else like 
that. 

33
00:02:44,000 --> 00:02:51,037
This isn't data level parallelism, 
because you can't try to have each of the 

34
00:02:51,037 --> 00:02:53,708
operations working on these packets be 
the same. 

35
00:02:53,708 --> 00:02:57,769
So depending on the type of packet, 
you're going to make wildly different 

36
00:02:57,769 --> 00:03:01,107
control flow decisions. 
So for instance, if the packet is TCP 

37
00:03:01,107 --> 00:03:04,140
versus UDP, 
is, is, like a big choice in the 

38
00:03:04,140 --> 00:03:07,810
beginning. 
And then, is it, what port is it on, is 

39
00:03:07,810 --> 00:03:13,511
it SSH, is it HTTP, is it web traffic? 
You know, you're going to do lot's of 

40
00:03:13,511 --> 00:03:18,664
different processing in your firewall 
based on that decision tree. 

41
00:03:18,664 --> 00:03:24,443
So, typically the most efficient way to 
go out and try to attack this is that you 

42
00:03:24,443 --> 00:03:31,840
just have a job or a process or possibly 
a thread per data element that comes in. 

43
00:03:33,260 --> 00:03:38,208
Another, reason you might want to do this 
is there are examples of applications 

44
00:03:38,208 --> 00:03:43,580
where you actually having parallel 
threads will solve the problem faster. 

45
00:03:43,580 --> 00:03:47,790
so believe it or not the, the traditional 
traveling salesman problem. 

46
00:03:47,790 --> 00:03:52,529
If you throw threads at that, 
you can actually have this problem get 

47
00:03:52,529 --> 00:03:56,962
super linear speed-ups, because you can 
have basically other threads sort of 

48
00:03:56,962 --> 00:04:01,278
parsing off the search tree faster. 
And then, the threads can share data 

49
00:04:01,278 --> 00:04:05,420
between each other, and when another 
thread comes to a certain point, it 

50
00:04:05,420 --> 00:04:09,736
doesn't have to recompute a certain 
location. It can just use the previous 

51
00:04:09,736 --> 00:04:14,460
result of another thread that already got 
to that point and know that it doesn't 

52
00:04:14,460 --> 00:04:18,018
have to go down a certain path. 
So this is kind of like dynamic 

53
00:04:18,018 --> 00:04:21,751
programming, if you will, but for some 
sort of large search. 

54
00:04:21,751 --> 00:04:27,458
That's one example, as a very plain 
example having threads, get you super 

55
00:04:27,458 --> 00:04:31,899
linear speed up. 
but you can also just use threads to go 

56
00:04:31,899 --> 00:04:37,508
after a certain types of parallelism that 
are hard to get at data level 

57
00:04:37,508 --> 00:04:41,248
parallelism. 
Also in the, you could actually hide 

58
00:04:41,248 --> 00:04:46,157
latencies by using threads. 
So an example hiding latency using 

59
00:04:46,157 --> 00:04:52,312
threads is, let's say you have a program 
that you know typically misses in your 

60
00:04:52,312 --> 00:04:56,411
cache, 
but you have parallelism in this program. 

61
00:04:56,411 --> 00:05:02,291
One way to attack it is to cut up the 
problem and have a thread per different 

62
00:05:02,291 --> 00:05:08,096
problem, and when one of the programs 
blocks on the cache, switch to the other 

63
00:05:08,096 --> 00:05:11,296
program. 
So while you're waiting for memory to 

64
00:05:11,296 --> 00:05:19,190
respond back you can do some useful work. 
Okay, so let's look at this from a 

65
00:05:19,190 --> 00:05:23,614
pipeline perspective and look at ways to 
recover cycles. 

66
00:05:23,614 --> 00:05:28,688
So here we have some loads. 
Load, load, something that's dependent 

67
00:05:28,688 --> 00:05:32,177
on, 
we've a, we've a load that's dependent on 

68
00:05:32,177 --> 00:05:36,854
another load and an add that's dependent 
on the second load. 

69
00:05:36,854 --> 00:05:41,135
And then finally, a store which is 
dependent on the add. 

70
00:05:41,135 --> 00:05:47,002
So the down, down side of doing this is, 
you got all this sort of dead time. 

71
00:05:47,002 --> 00:05:51,680
All this purple time here is, is dead 
time on the processor. 

72
00:05:54,040 --> 00:05:58,047
We could throw an out-of-order super 
Scalar at this. 

73
00:05:58,047 --> 00:06:08,986
It's not going to do any better. 
Well yeah? So you're right this could, 

74
00:06:08,986 --> 00:06:13,723
if we're doing bypassing, we can pull 
these one, one cycle earlier. 

75
00:06:13,723 --> 00:06:21,091
But we still have a lot of dead time. 
And if you were to have an out of order 

76
00:06:21,091 --> 00:06:25,093
super Scalar, 
out of order would not actually help you 

77
00:06:25,093 --> 00:06:30,598
out in this case either because you, is 
an actual dependency string through all 

78
00:06:30,598 --> 00:06:36,940
these instructions. 
Hm, now that's, that's not great. 

79
00:06:38,080 --> 00:06:45,844
So can we, 
can we come up with some ideas to, to 

80
00:06:45,844 --> 00:06:48,852
cope with this. 
So one thing we were said is we can add 

81
00:06:48,852 --> 00:06:53,174
bypassing to, to decrease the time here. 
But having out of order super Scalar is 

82
00:06:53,174 --> 00:06:57,962
not going to make this go faster. 
So that technique doesn't work. what 

83
00:06:57,962 --> 00:07:00,948
other techniques have we talked about? 
Vector processors? 

84
00:07:00,948 --> 00:07:03,289
Well, there's no, no vectors of data 
here. 

85
00:07:03,289 --> 00:07:05,981
we can go wide. 
Well, that doesn't hap-, help. 

86
00:07:05,981 --> 00:07:09,610
We're still going to only be executing 
one instruction per cycle. 

87
00:07:09,610 --> 00:07:11,659
So we can, 
we can try to do VLIW. 

88
00:07:11,659 --> 00:07:15,230
Doesn't, doesn't help. 
If out of order superscalar can't do it, 

89
00:07:15,230 --> 00:07:19,093
the VLIW probably can't do it. 
So we have all these dead cycles, 

90
00:07:19,093 --> 00:07:22,020
we want to try to recover some of these 
dead cycles.