1 00:00:03,340 --> 00:00:09,400 Okay so let's switch topics here. We finished our vector processors, 2 00:00:11,740 --> 00:00:15,100 and we're going to start talking about technique. 3 00:00:15,100 --> 00:00:19,234 That's looking at how to exploit a different, different forms of 4 00:00:19,234 --> 00:00:24,272 parallelism, or thread-level parallelism, than we've been looking at up to this 5 00:00:24,272 --> 00:00:27,866 point. So this is not data level parallelism. 6 00:00:27,866 --> 00:00:33,953 This is actual thread level parallelism or process level parallelism in your 7 00:00:33,953 --> 00:00:39,386 machine. So what's the motivation for changing 8 00:00:39,386 --> 00:00:43,773 your computer architectures to run threads very effectively? 9 00:00:43,773 --> 00:00:48,979 So we've talked about threading or multi-programming in the past, where you 10 00:00:48,979 --> 00:00:54,907 time slice between different processes. That's not what this is about. 11 00:00:54,907 --> 00:01:00,594 This is about either, this is about executing multiple programs at the same 12 00:01:00,594 --> 00:01:05,512 time, or time multiplexing between processes on the time slice of the 13 00:01:05,512 --> 00:01:10,980 instruction at a time. So where this really came from is that 14 00:01:10,980 --> 00:01:15,823 sometimes you can't really extract parallelism in the instructional 15 00:01:15,823 --> 00:01:20,950 parallelism. So this is where out of border processors, try to exploit, 16 00:01:20,950 --> 00:01:25,788 and super Scalars try to exploit. And sometimes you can't necessarily 17 00:01:25,788 --> 00:01:31,117 exploit data level parallelism in a program because data level parallelism 18 00:01:31,117 --> 00:01:37,300 will say it doesn't necessarily work out, or you don't have dense matrix sorts of 19 00:01:37,300 --> 00:01:41,279 codes, which is very easy to find data level parallelism. 20 00:01:41,279 --> 00:01:45,756 Instead, sometimes your workloads have thread level parallelism, 21 00:01:45,756 --> 00:01:51,037 and what we mean by thread level parallelism is that either there is 22 00:01:51,037 --> 00:01:54,742 independent sequential jobs that need to happen, 23 00:01:54,742 --> 00:02:00,680 and an example of these would be you have to operate on a million things, 24 00:02:00,680 --> 00:02:04,995 but the control flow for each of those million things is wildly different. 25 00:02:04,995 --> 00:02:09,776 So you, the, the most efficient way to go about doing this is just to actually have 26 00:02:09,776 --> 00:02:15,097 a process or a job per work unit that you need to work on. 27 00:02:15,097 --> 00:02:18,953 So, a good example of this is network processing. 28 00:02:18,953 --> 00:02:26,040 You are trying to build a fire wall and a packet comes in on your network card. 29 00:02:26,040 --> 00:02:30,652 And your firewall wants to inspect this packet and look for, I don't know, 30 00:02:30,652 --> 00:02:34,127 malicious tacks, or it looks for, looks, want, it wants to 31 00:02:34,127 --> 00:02:39,246 look for something like viruses or it wants to look for attack code sequences 32 00:02:39,246 --> 00:02:41,900 or trojans or, or something else like that. 33 00:02:44,000 --> 00:02:51,037 This isn't data level parallelism, because you can't try to have each of the 34 00:02:51,037 --> 00:02:53,708 operations working on these packets be the same. 35 00:02:53,708 --> 00:02:57,769 So depending on the type of packet, you're going to make wildly different 36 00:02:57,769 --> 00:03:01,107 control flow decisions. So for instance, if the packet is TCP 37 00:03:01,107 --> 00:03:04,140 versus UDP, is, is, like a big choice in the 38 00:03:04,140 --> 00:03:07,810 beginning. And then, is it, what port is it on, is 39 00:03:07,810 --> 00:03:13,511 it SSH, is it HTTP, is it web traffic? You know, you're going to do lot's of 40 00:03:13,511 --> 00:03:18,664 different processing in your firewall based on that decision tree. 41 00:03:18,664 --> 00:03:24,443 So, typically the most efficient way to go out and try to attack this is that you 42 00:03:24,443 --> 00:03:31,840 just have a job or a process or possibly a thread per data element that comes in. 43 00:03:33,260 --> 00:03:38,208 Another, reason you might want to do this is there are examples of applications 44 00:03:38,208 --> 00:03:43,580 where you actually having parallel threads will solve the problem faster. 45 00:03:43,580 --> 00:03:47,790 so believe it or not the, the traditional traveling salesman problem. 46 00:03:47,790 --> 00:03:52,529 If you throw threads at that, you can actually have this problem get 47 00:03:52,529 --> 00:03:56,962 super linear speed-ups, because you can have basically other threads sort of 48 00:03:56,962 --> 00:04:01,278 parsing off the search tree faster. And then, the threads can share data 49 00:04:01,278 --> 00:04:05,420 between each other, and when another thread comes to a certain point, it 50 00:04:05,420 --> 00:04:09,736 doesn't have to recompute a certain location. It can just use the previous 51 00:04:09,736 --> 00:04:14,460 result of another thread that already got to that point and know that it doesn't 52 00:04:14,460 --> 00:04:18,018 have to go down a certain path. So this is kind of like dynamic 53 00:04:18,018 --> 00:04:21,751 programming, if you will, but for some sort of large search. 54 00:04:21,751 --> 00:04:27,458 That's one example, as a very plain example having threads, get you super 55 00:04:27,458 --> 00:04:31,899 linear speed up. but you can also just use threads to go 56 00:04:31,899 --> 00:04:37,508 after a certain types of parallelism that are hard to get at data level 57 00:04:37,508 --> 00:04:41,248 parallelism. Also in the, you could actually hide 58 00:04:41,248 --> 00:04:46,157 latencies by using threads. So an example hiding latency using 59 00:04:46,157 --> 00:04:52,312 threads is, let's say you have a program that you know typically misses in your 60 00:04:52,312 --> 00:04:56,411 cache, but you have parallelism in this program. 61 00:04:56,411 --> 00:05:02,291 One way to attack it is to cut up the problem and have a thread per different 62 00:05:02,291 --> 00:05:08,096 problem, and when one of the programs blocks on the cache, switch to the other 63 00:05:08,096 --> 00:05:11,296 program. So while you're waiting for memory to 64 00:05:11,296 --> 00:05:19,190 respond back you can do some useful work. Okay, so let's look at this from a 65 00:05:19,190 --> 00:05:23,614 pipeline perspective and look at ways to recover cycles. 66 00:05:23,614 --> 00:05:28,688 So here we have some loads. Load, load, something that's dependent 67 00:05:28,688 --> 00:05:32,177 on, we've a, we've a load that's dependent on 68 00:05:32,177 --> 00:05:36,854 another load and an add that's dependent on the second load. 69 00:05:36,854 --> 00:05:41,135 And then finally, a store which is dependent on the add. 70 00:05:41,135 --> 00:05:47,002 So the down, down side of doing this is, you got all this sort of dead time. 71 00:05:47,002 --> 00:05:51,680 All this purple time here is, is dead time on the processor. 72 00:05:54,040 --> 00:05:58,047 We could throw an out-of-order super Scalar at this. 73 00:05:58,047 --> 00:06:08,986 It's not going to do any better. Well yeah? So you're right this could, 74 00:06:08,986 --> 00:06:13,723 if we're doing bypassing, we can pull these one, one cycle earlier. 75 00:06:13,723 --> 00:06:21,091 But we still have a lot of dead time. And if you were to have an out of order 76 00:06:21,091 --> 00:06:25,093 super Scalar, out of order would not actually help you 77 00:06:25,093 --> 00:06:30,598 out in this case either because you, is an actual dependency string through all 78 00:06:30,598 --> 00:06:36,940 these instructions. Hm, now that's, that's not great. 79 00:06:38,080 --> 00:06:45,844 So can we, can we come up with some ideas to, to 80 00:06:45,844 --> 00:06:48,852 cope with this. So one thing we were said is we can add 81 00:06:48,852 --> 00:06:53,174 bypassing to, to decrease the time here. But having out of order super Scalar is 82 00:06:53,174 --> 00:06:57,962 not going to make this go faster. So that technique doesn't work. what 83 00:06:57,962 --> 00:07:00,948 other techniques have we talked about? Vector processors? 84 00:07:00,948 --> 00:07:03,289 Well, there's no, no vectors of data here. 85 00:07:03,289 --> 00:07:05,981 we can go wide. Well, that doesn't hap-, help. 86 00:07:05,981 --> 00:07:09,610 We're still going to only be executing one instruction per cycle. 87 00:07:09,610 --> 00:07:11,659 So we can, we can try to do VLIW. 88 00:07:11,659 --> 00:07:15,230 Doesn't, doesn't help. If out of order superscalar can't do it, 89 00:07:15,230 --> 00:07:19,093 the VLIW probably can't do it. So we have all these dead cycles, 90 00:07:19,093 --> 00:07:22,020 we want to try to recover some of these dead cycles.