Okay so let's switch topics here. We finished our vector processors, and we're going to start talking about technique. That's looking at how to exploit a different, different forms of parallelism, or thread-level parallelism, than we've been looking at up to this point. So this is not data level parallelism. This is actual thread level parallelism or process level parallelism in your machine. So what's the motivation for changing your computer architectures to run threads very effectively? So we've talked about threading or multi-programming in the past, where you time slice between different processes. That's not what this is about. This is about either, this is about executing multiple programs at the same time, or time multiplexing between processes on the time slice of the instruction at a time. So where this really came from is that sometimes you can't really extract parallelism in the instructional parallelism. So this is where out of border processors, try to exploit, and super Scalars try to exploit. And sometimes you can't necessarily exploit data level parallelism in a program because data level parallelism will say it doesn't necessarily work out, or you don't have dense matrix sorts of codes, which is very easy to find data level parallelism. Instead, sometimes your workloads have thread level parallelism, and what we mean by thread level parallelism is that either there is independent sequential jobs that need to happen, and an example of these would be you have to operate on a million things, but the control flow for each of those million things is wildly different. So you, the, the most efficient way to go about doing this is just to actually have a process or a job per work unit that you need to work on. So, a good example of this is network processing. You are trying to build a fire wall and a packet comes in on your network card. And your firewall wants to inspect this packet and look for, I don't know, malicious tacks, or it looks for, looks, want, it wants to look for something like viruses or it wants to look for attack code sequences or trojans or, or something else like that. This isn't data level parallelism, because you can't try to have each of the operations working on these packets be the same. So depending on the type of packet, you're going to make wildly different control flow decisions. So for instance, if the packet is TCP versus UDP, is, is, like a big choice in the beginning. And then, is it, what port is it on, is it SSH, is it HTTP, is it web traffic? You know, you're going to do lot's of different processing in your firewall based on that decision tree. So, typically the most efficient way to go out and try to attack this is that you just have a job or a process or possibly a thread per data element that comes in. Another, reason you might want to do this is there are examples of applications where you actually having parallel threads will solve the problem faster. so believe it or not the, the traditional traveling salesman problem. If you throw threads at that, you can actually have this problem get super linear speed-ups, because you can have basically other threads sort of parsing off the search tree faster. And then, the threads can share data between each other, and when another thread comes to a certain point, it doesn't have to recompute a certain location. It can just use the previous result of another thread that already got to that point and know that it doesn't have to go down a certain path. So this is kind of like dynamic programming, if you will, but for some sort of large search. That's one example, as a very plain example having threads, get you super linear speed up. but you can also just use threads to go after a certain types of parallelism that are hard to get at data level parallelism. Also in the, you could actually hide latencies by using threads. So an example hiding latency using threads is, let's say you have a program that you know typically misses in your cache, but you have parallelism in this program. One way to attack it is to cut up the problem and have a thread per different problem, and when one of the programs blocks on the cache, switch to the other program. So while you're waiting for memory to respond back you can do some useful work. Okay, so let's look at this from a pipeline perspective and look at ways to recover cycles. So here we have some loads. Load, load, something that's dependent on, we've a, we've a load that's dependent on another load and an add that's dependent on the second load. And then finally, a store which is dependent on the add. So the down, down side of doing this is, you got all this sort of dead time. All this purple time here is, is dead time on the processor. We could throw an out-of-order super Scalar at this. It's not going to do any better. Well yeah? So you're right this could, if we're doing bypassing, we can pull these one, one cycle earlier. But we still have a lot of dead time. And if you were to have an out of order super Scalar, out of order would not actually help you out in this case either because you, is an actual dependency string through all these instructions. Hm, now that's, that's not great. So can we, can we come up with some ideas to, to cope with this. So one thing we were said is we can add bypassing to, to decrease the time here. But having out of order super Scalar is not going to make this go faster. So that technique doesn't work. what other techniques have we talked about? Vector processors? Well, there's no, no vectors of data here. we can go wide. Well, that doesn't hap-, help. We're still going to only be executing one instruction per cycle. So we can, we can try to do VLIW. Doesn't, doesn't help. If out of order superscalar can't do it, the VLIW probably can't do it. So we have all these dead cycles, we want to try to recover some of these dead cycles.