Welcome to part one of our probability review. The first time that we need these concepts in this course is for those that wanna understand the analysis of Quick Sort. Why it runs in big O of end log and time on average. And these topics will also come up a couple of other times in the course. For example when we study a randomized algorithm for the minimum cut problem in graphs. And also when we try to understand the performance of hashing. Here are the topics we're going to cover. We'll start at the beginning with sample spaces, and then we'll discuss, events and their probabilities. We'll talk about random variables, which are real valued functions on a sample space. We'll talk about expectation, which is basically the average value of a random variable. We'll identity and prove a very important property, called the linearity of expectation, which will come up over and over again in our analyses of randomized processes. So that's gonna be the topics for part one, that will conclude the video with one example tying these concepts together in load balancing. And, this video is by no means the only source you can turn to, to learn. About these concepts. A couple of other sources I recommend are the online lecture notes by Eric [inaudible] and Tom [inaudible]. Also, there's a wiki book on discreet probability, which you could, check out. And I wanna be clear, this is really not meant to be a course or a tutorial on probability concepts. It's really only meant to be a refresher. So I'm gonna go at a reasonably fast pace, and it's gonna be a pretty cursory presentation. And if you want a more thorough review, I would, I would check out one of these other sources, or your favorite book on discreet probability. And along those same lines, I'm thinking that many of you have seen some of this material before. Don't feel compelled, like, to, to watch this video straight [inaudible]. Beginning to the end. Feel free to just sort of dip in, and review the concepts, that you need a refresher on. So, let's start at the beginning with sample spaces. So, what is a sample space? Well we're analyzing random processes, so any number of things can happen. And the sample space is just a collection of all of those things that could happen. So this is basically the universe in which we're gonna discuss probabilities and average values. So I'll often use annotation bigo mega to describe the sample space. So one thing that we've got going for us in the design of algorithm is typically we can take omega to be a finite set. So that's why we're only dealing only with discreet probability which is a very happy thing. Cause that's much more elementary then more general probability. So I realize this is sort of a super abstract concept, and the next few definitions will also be abstract, so I'm going to use two very simple, very concrete running examples to illustrate the next few concepts. So the first example is just going to be, we take two six sided dice and we roll them. So then of course what's the sample space? Well, it's just all of the possible outcomes of just two dice. In addition to defining the outcomes; everything that could possibly happen, we need to define what is the probability of each individual outcome. So, of course, the probability of each outcome should be at least zero, should be non negative and there's also the obvious constraint that the sum of the probability should be one. So, exactly one thing's gonna happen. Now I realize this is a, a super abstract concept and the next few definitions are also a little abstract, so throughout them I'm going to use two really simple, really concrete examples to illustrate what these concepts mean. So, the first example is just gonna be you take two six sided dice, and you roll them, and then of course, the sample space is just the 36 different outcomes you could have of these two dice. And assuming that each of these two dice is well crafted, then we expect each of these 36 outcomes to be equally likely to occur with a probability of 1/36. The second running example I'm going to use is more directly related to algorithms and it's motivated by the quick sort algorithm. Recall that we're studying the implementation of quick sort that chooses a pivot uniformly at random in every recursive call. So let's just focus on the very first outermost call of quick sort and think about the random choice of the pivot just in that call. So then the sample space, all of the different things that could happen is just all of the N different choices for a pivot, assuming the array has length N. So we can represent the sample space just as integers one, two all the way up to N, corresponding to the array index, of the randomly chosen pivot. And again, by definition, by the construction of our code, each of these events, each of these things is equally likely, probability of one over N. Now let's talk about events. An event is nothing more than a subsets of all of the things that could happen. That is a subset of the sample space omega. The probability of the event is exactly what you would think it would be. It's just the sum of the probabilities of all of the outcomes contained in that event, right? So an event is just a bunch of stuff that might happen, we know the probability of each individual thing that could happen, we add them up to get the probability of an event. So the next two quizzes are meant to give you some practice with these concepts and in particular, the last few to compute the probability of events in our two running examples. So in the first quiz, this is our first running example where we think about two dice and we have our 36 possible outcomes. Consider the subset of outcomes in which the sum of the two dice equals seven. What is the probability of that event? Right, so the correct answer is the third one, the probability is one over six. Why is that? Well, first, let's be more precise about what this event is. What are the outcomes in which the sum of the dices equal to seven. Well there's exactly six such outcomes. One six. Two five. Three four. Four three. Five two and six one. Each of the 36 outcomes is equally likely has a probability of one over 36. So we have six members of the set, usually [inaudible] one over 36, so the probability is one sixth. Let's move on to the next quiz, which considers our second running example, namely the randomly chosen pivot in the outermost call to quick sort on an input array of length n. So, recall that in quick sort, when you choose a pivot, you then partition the array around the pivot, and this splits the input array into two sub arrays. A left one, elements less that the pivot, and a right one, those bigger than the pivot. And the more balanced the split into these two sub problems, the better. So, ideally, we'd like a fifty, fifty split. So what this quiz asks you is what fraction of pivots, that is what's probability that a randomly chosen pivot. [inaudible] will give you a reasonably good split, meaning both of the sub-problems have size at least 25%. That is you get a split 25/75 or better. That's what this quiz asks about, what's the probability that the randomly chosen pivot satisfies that property? So the correct answer to this quiz is again, the third option. It's a 50 percent probability you get a 25/75 split or better. So to see why, let's, again, be precise about what is the event that we're talking about? Then we'll compute its probability. So, when does a pivot give you a 25/75 split or better? Well, for concreteness, suppose the array contained just the integers between one and 100. Now, what's the property we want? We want that both of the two sub arrays, have at least 25 percent of the elements. Neither one has more than 75 percent of the elements. Well, if we choose an element that's 26 or bigger in value, then the left sub problem will have at least 25 elements. It's the numbers one through 25, and if we choose an index which is, choose an element that's at most 75, then the right sub-array's going to have at least 25 elements in the numbers 76 to 100. So anything between 26 and 75, inclusive is going to give us a 25-75 split, more generally. Any pivot from the middle 50 percent of the quan, of the quintiles is gonna give us the desired split. So we do badly if we get something within the first quarter, we do badly if we get something within the last quarter. Anything in the middle works. So more formally we can say that the event s that we're analyzing is among the possible pivot choices. We're interested in the ones that is not in the first quarter or not in the last quarter. Now the coordinately of; the number of pivots in this set is essentially one-half of the overall pivot choices. I'm ignoring, fractions here for simplicity. So the probability of this event is the, coordinality of this. Divided by or times the probability of each of the individual outcomes. And since we choose the pivot unit from them at random each one has the probability of one over n so you get n over two divided by n or one half. Okay, now that we've explored the concept of events in our running two examples, we see that the probability that the sum of two dice is equal to one-sixth. A useful fact to know if you're every playing craps. And we know that a pivot gives us a 25, 75 split or better in randomized quick sort with fifteen percent probability, a useful fact if you want to develop intuition for why quick sort is in fact quick. So that's events. Let's move on to random variables. So random variables are basically some statistic measuring what happens in the random outcome. So formally if we want to define it, it's a real value function defined on the sample space omega. So given an outcome, given a realization of the randomness this gives you back a number. Now the random variable that we most often care about in algorithm design is the running time of a randomized algorithm. That's the case for example with the quick sort algorithm. So notice that is in fact a random variable. If we know the state of the world, if we know the outcome of all of the coin flips that our code is going to make, then there is just some running time of our algorithm. So in that sense it's a random variable. Given the outcomes of the coin flips, out pops a number, the running time, say in milliseconds of the algorithm. Here I'm gonna give you a couple more modest examples of random variables in our two running examples. If we're rolling two dice, one very simple random variable takes as input the outcome, so that the result of the two dice, and spits out, the sum. So that's certainly a reign of variable. In any given outcome it's gonna take on some integer value between two at the minimum and twelve at the maximum. Our second running example is the randomly chosen pivot made by the outermost [inaudible] quick sort. So let's think about the random variable which is the size, meaning the sub array length past to the first recursive call. Equivalently this reign of variable is the number of elements of the input array smaller than the randomly chosen pivot. So, this is the random variable that takes on some integral value between zero at the smallest, that's if we happen to pick the pivot equal to the minimum of the array and n minus one at the largest. That's if we happen to pick the maximum element as the pivot element. Next let's talk about the expectation of a random variable. This is really nothing more than the average, and of course when you take the average of some statistic, you want to do it weighted by the probability of its various values. So let's just make that precise real quick. So consider some random variable, capital x, the expectation, this is also called the expected value. And the notation is capital e square bracket, then of the random variable. And again in English the expectation is just the average value, naturally weighted by the probability of the various possible outcomes. Or more mathematically we sum, over everything that could happen so that little I denote one possible outcome. We look at the value of this random variable, when that outcome occurs and then we weight it times the probability of that outcome occurring. So the next two quizzes ask you to compute the expectation of the two random variables that we identified on the previous slide. So the first quiz is about two dice and the random variable, which is the sum of the values of those two dice. What is the average value of that random variable? What is its expectation? Okay, so the answer to this question is the second option. The average value is seven. There's a bunch of different ways to see that. In my opinion, the best way to compute this is using linearity of expectation, which is the next concept we're gonna cover. If you wanted to, you can just compute this by brute force by which I mean you could iterate over all 36 possible outcomes, look at the value of the two dice in each and just evaluate that sum we had in the definition on the last slide. A slightly sneakier way to do it if you don't know linearity of expectation would be to pair up the various outcomes. So, it's equally likely that sum of the two dice is two. Or twelve. It's equally likely to be three. Or eleven. Four and ten and so on. Each way of paring up these values of the two dice results in fourteen. And when you average you get seven. But again, this, the right way to do this is linear of expectation which we'll cover next. So, the second quiz covers the second random variable we identified. So now we're back to Quicksort and the random pivot shows in the outermost [inaudible] and the question is how big, on average, an expectation is the sub-array in the first recursive call? Equivalently, on average how many elements are going to be less than the randomly chosen pivot? So the correct answer to this quiz is the third option. In fact, it's actually quantity N minus one over two, not N over two, but basically half of the elements. Again, there's sort of a sneaky way to see this, if you want, which is that clearly the two recursive calls are symmetric. The expected value of the left recursive call is going to be the same as the size of the right recursive call. The two recursive calls always comprise N minus one of the elements, so because they're symmetric, you expect half in each, so N minus one over two in each. Though for this problem, I think it's perfectly fine just to compute this using the definition of expectation. So if we let X denote the random variable that we care about; the sub array size. Then we can just compute directly by summing over all of the possible outcomes, all of the possible choices of the pivot. So with probability one over N we chose the minimum of the pivot resulting in zero elements being passed to the first recursive call. With probability one over N we pick the second smallest element resulting in one element being passed to the first recursive call. If probability one over N we pick the third smallest given us a summary size of two. And so on. And then, with probability one over N, we picked the maximum element, giving us the sub array size of N-1. So if you just sort of compute this sum out, you will get, as expected, N-1 over two. So expectation is the last definition that I'm going to give you in this part one of the probability review. Next if our fifth and final concept for this video, which is linearity of expectation. And that's not a definition, that's more of a theorem. So what is linearity of expectation? Well this is a very simple property of random variables that's super, super important. This comes up all the time when we analyze randomized algorithms and random processes more generally. So what is linearity of expectation? Well it's the following very simple claim. Which I'll sometimes denote just by line X for short. Suppose you've got a bunch of random of variables defined on the same sample space. Then if you want to think of the expected value of the sum of these random of variables. It doesn't matter if you take the sum first and then take the expectation. Or if you take the expectations first and then the sum. That is the expected value of a sum of random variables. Is equal to, the sum of the expectations of the individual random variables. And one of the reasons lean, linearity of expectation is so ubiquitously useful is because it always works no matter what these random variables are, in particular, even when the random variables are not independent. Now, I haven't defined independent random variables yet. That will come in part two of the probability review, but hopefully you have an intuitive sense of what independence means. So things are independent if knowing something about one of the random variables doesn't influence what you expect from the other random variable. Now I realize the first time you see linearity of expectation, it's a little hard to appreciate. So first of all as far as applications, we'll see plenty throughout this course. Pretty much every single application of probability that we'll see, the analysis will involve linearity of expectation. But it may be hard to appreciate why this is not a tautology. Just symbolically it may look like it has to be true. But to point out that there is content here. If I replace the sums by products, then this equation would in general be false. If the reign of variables are not independent. So the same thing is not true about products. It's really about sums. [sound] So let me just give you a trivial illustration of linearity of expectation, point out how it really easily allows us to evaluate the sum of two dice. Doing our first running example let's introduce the first random variables X one and X two for the results of the first and second die, respectively. Now computing the expected value of a single die is easy. There's only six values to enumerate over. Contrast that with the 36 outcomes to numerate over when we evaluated the sum of the two dies. So the average value of a single die, you won't be surprised to hear, is 3.5, right? So it ranges, integers between one and six uniformly, so 3.5 on average and now, using [inaudible] expectation, the sum of two dice is simply double the average value of a single one. So in the next slide, I'm going to prove this property, prove linearity of expectation. But frankly, the proof is, is pretty trivial. So if you don't care about the proof, that's fine. You can skip it without loss. I'm including it just for completeness. And I gotta say, I don't know of another mathematical statement which is simultaneously so trivial to prove, and so unbelievably useful. There's really something remarkable about linearity of expectation. So how does the proof go? Well, honestly, we just write out the sum, the definition of an expectation. We reverse the sums, and we're done. So, let me start with the right-hand side of the equation. So that's, that was the sum of the expectations of the random variables. So now, let's just apply the definition of expectation. So, it's just a weighted average over the possible outcomes. And now, instead of summing first over the, the reign of variable j and then over the realized outcome I, I am going to, to it in a reversed order. I'm going to sum first over the outcome I and then over the random variable j. Now the probability of outcome I is independent of j. So we can yank the p of I outside of that inter-sum. But now what have we got? So inside the parentheses, we simply have the value of the sum of the X-I's, X-X-J's on the outcome I. And over here, we're just averaging the sum of the X-J's with respect to the probabilities, the P-I's. So this is just the definition of the expectation of the sum of the reign of variables. So that's it. So linearity of expectation is really just a reversal of the double sums. Now for those of you who are rusty on these kind of manipulations, I just want to point out, you know this reversal of the double sum itself, is there's nothing complicated at all, about what's going on. So if you want a really pedestrian way to think about what's happening, just imagine that we take these sum adds, these [inaudible]. And we just write them out in a grid, where one, or let's just say the columns are indexed by the random variable J and the rows are indexed by the outcome I. And in a given cell of this grid, we'd just write the sum in XJI times PI And the point is both of the double sums, both the first one and the second one, are just the sum of all of the entries in this grid. In the first sum we first group things according to the row and then sum each of them up and the second sum, we group things according to the column. We do column sums and then sum up the column sums. But either way it's just some of the entries in the grid, so it doesn't matter which way you sum. Okay? So reversal of double sums is just a trivial thing. So if you get lost in the notation with these double sums, the point is, you can just interpret each of them in terms of this grid. Both of these double sums are nothing more than the sum of the values in all of the cells of this grid. One order of summation just says you group first according to row sums, and then sum those up. That's the first summation. The second summation, you first take column sums, and then sum those up. But of course it doesn't matter. You just get the results of everything in the grid, okay? So there's no, [inaudible], no tricks up my sleeve when I reverse these sums. It's a totally elementary trivial thing, okay? So, again, linearity expectation. Trivial to prove. Incredibly useful, don't forget it. So I want to conclude this video with one final example in order to tie together all of the concepts that we just learned or just reviewed. And, it's going to be an example of load balancing. A sign in process used to servers. But this in fact is quite important for the analysis of hashing that we're going to see toward the end of the course as well. But for now let's just think about the, the following simple problem. For some integer n you have n computer processes would have to be assigned to n servers in some way. Now you're feeling really lazy. Okay. So you're just gonna take each of the processes and you're just gonna assign it to a totally random server. Okay? With each server equally likely to get a given process. And the question I want to study is, does this laziness cost you, at least on average? So if you look at a server, what's the expected load? So let's proceed to the solution, the answer to this question. So, before you start talking about expectations, one has to be clear about the sample space and what are the probabilities of the various outcomes? So remember the sample space omega just denotes every possible thing that can happen. So what are we doing? For each process, we're assigning it to a random server, so all of the things that can happen are all of the different assignments of these end processes to these end servers and if you think about it, there are N raised to the N possible outcomes, because you have N choices for each of the N processes. Moreover, because each process is assigned to one of the servers uniformly at random, each of these [inaudible] assignments is equally likely. Probability one over N to the N. Now that we have a sample space, we're in a position to define a random variable. And we already know what random variable we care about. We care about the average load of a server. Now, all of the servers are exactly the same, so we just have to focus on one server, let's say the first server. And look at the number of processes assigned to it. And if you go back to the problem statement what we're asked is to compute the expected value of Y. The expected number of processes assigned to a server. [sound] Now, of course, in principle we could go to the definition of expectation and just compute by brute force the sum over all possible outcomes of the value of y and take the average. Unfortunately there are end of the end different outcomes. And that's a lot. So, what can we do other than this brute force computation? Well, recall our example of linear of expectation and the sum of two dice. We observed that instead of com. Computing the sum by enumerating all thirty-six outcomes. It was much better to focus on a single dye, compute its expectation and then conclude with linearity of expectation. So we'll do the same thing here, instead of focusing on the sum Y, we'll focus on constituent parts of Y. So whether or not a single process gets assigned to the first server and then we'll get away with that with linearity of expectation. So, more precisely, for a given process, J, let's define XJ to be whether to be one if and only if the j process gets assigned to the first server, zero otherwise. 01 random variables like XJ are often called indicator random variables. That's because they, in effect, indicate whether or not a certain event occurs. In this case, whether or not the [inaudible] process gets assigned to the first server. Why did I make this definition? Well, observe that the total number of processes that gets assigned to the first server is simply the sum from J equal one to N of XJ. Xj Says whether or not a given process; the Jade process is on the first server. The total number is the sum of these over all J. Now the benefit from this maneuver is we only to compute the expectation of a extremely simple indicator random variable XJ. This is like the win that we got when we're summing up two dice by instead of having to compute the sum, the expected value of the sum, we just had to focus on the expectation of a single die. That was really easy. Similarly here the expectation of a single XJ is really easy. Specifically, let's write it out just using the definition of the expectation. So the expected value of an X J is. Well. Let's group together all the outcomes in which it takes on the value zero. So, the contributions of the expectation is zero for all of those outcomes. And then there's the rest of the outcome where X J takes on the value one. And in those cases it contributes one to the expectation. Now, obviously, we get some happy cancellation happening here, with the zero part, and all we have to worry about is the probability that XJ takes on the value one. Okay, what was XJ again? How did we define it? Remember it's the event that, it's one exactly when the Jth process gets assigned to the first server. How are processes assigned. Or, remember the proposed solution assigned to each process, to each of the end servers, equally likely, with uniform probability. So, the probability that the Jth process gets assigned to the first server, is one over N. So this leaves us with just the sum, from j equal one to n, of one over n. That is, we just sum up one over n with itself n times. This of course is equal to one. So summarizing, the expected number of processes assigned to a given server is exactly one. So really on average we don't lose much for our laziness of just randomly spraying the process to the servers. On average for any given server we do fine. And that's a good illustration of the role that randomness plays in a lot of algorithm designed in computer science more generally. Often using random choices you can do surprisingly well with very little work and that's indeed part of the crux of quick short, if you choose pivots at random you get an average running time of N log N just as good as. So at the end of the day what we find is that the expected number of processes assigned to a given server; say the first server, is just one. So, at least if we only care about averages, we lose very little from this trivial process of randomly spraying the process through the server. On average any given server has just one process on it. This is characteristic of the role that randomization plays in algorithm design and computer science more generally. Often we can get away with really simple heuristics just by making random choices. Of course, quicksort is one example of that, where we get an extremely prevalently used practical sorting algorithm just by making randomly chosen pivots in every recursive call.