Welcome to part one of our probability
review. The first time that we need these
concepts in this course is for those that
wanna understand the analysis of Quick
Sort. Why it runs in big O of end log and
time on average. And these topics will
also come up a couple of other times in
the course. For example when we study a
randomized algorithm for the minimum cut
problem in graphs. And also when we try to
understand the performance of hashing.
Here are the topics we're going to cover.
We'll start at the beginning with sample
spaces, and then we'll discuss, events and
their probabilities. We'll talk about
random variables, which are real valued
functions on a sample space. We'll talk
about expectation, which is basically the
average value of a random variable. We'll
identity and prove a very important
property, called the linearity of
expectation, which will come up over and
over again in our analyses of randomized
processes. So that's gonna be the topics
for part one, that will conclude the video
with one example tying these concepts
together in load balancing. And, this
video is by no means the only source you
can turn to, to learn. About these
concepts. A couple of other sources I
recommend are the online lecture notes by
Eric [inaudible] and Tom [inaudible].
Also, there's a wiki book on discreet
probability, which you could, check out.
And I wanna be clear, this is really not
meant to be a course or a tutorial on
probability concepts. It's really only
meant to be a refresher. So I'm gonna go
at a reasonably fast pace, and it's gonna
be a pretty cursory presentation. And if
you want a more thorough review, I would,
I would check out one of these other
sources, or your favorite book on discreet
probability. And along those same lines,
I'm thinking that many of you have seen
some of this material before. Don't feel
compelled, like, to, to watch this video
straight [inaudible]. Beginning to the
end. Feel free to just sort of dip in, and
review the concepts, that you need a
refresher on. So, let's start at the
beginning with sample spaces. So, what is
a sample space? Well we're analyzing
random processes, so any number of things
can happen. And the sample space is just a
collection of all of those things that
could happen. So this is basically the
universe in which we're gonna discuss
probabilities and average values. So I'll
often use annotation bigo mega to describe
the sample space. So one thing that we've
got going for us in the design of
algorithm is typically we can take omega
to be a finite set. So that's why we're
only dealing only with discreet
probability which is a very happy thing.
Cause that's much more elementary then
more general probability. So I realize
this is sort of a super abstract concept,
and the next few definitions will also be
abstract, so I'm going to use two very
simple, very concrete running examples to
illustrate the next few concepts. So the
first example is just going to be, we take
two six sided dice and we roll them. So
then of course what's the sample space?
Well, it's just all of the possible
outcomes of just two dice. In addition to
defining the outcomes; everything that
could possibly happen, we need to define
what is the probability of each individual
outcome. So, of course, the probability of
each outcome should be at least zero,
should be non negative and there's also
the obvious constraint that the sum of the
probability should be one. So, exactly one
thing's gonna happen. Now I realize this
is a, a super abstract concept and the
next few definitions are also a little
abstract, so throughout them I'm going to
use two really simple, really concrete
examples to illustrate what these concepts
mean. So, the first example is just gonna
be you take two six sided dice, and you
roll them, and then of course, the sample
space is just the 36 different outcomes
you could have of these two dice. And
assuming that each of these two dice is
well crafted, then we expect each of these
36 outcomes to be equally likely to occur
with a probability of 1/36. The second
running example I'm going to use is more
directly related to algorithms and it's
motivated by the quick sort algorithm.
Recall that we're studying the
implementation of quick sort that chooses
a pivot uniformly at random in every
recursive call. So let's just focus on the
very first outermost call of quick sort
and think about the random choice of the
pivot just in that call. So then the
sample space, all of the different things
that could happen is just all of the N
different choices for a pivot, assuming
the array has length N. So we can
represent the sample space just as
integers one, two all the way up to N,
corresponding to the array index, of the
randomly chosen pivot. And again, by
definition, by the construction of our
code, each of these events, each of these
things is equally likely, probability of
one over N. Now let's talk about events.
An event is nothing more than a subsets of
all of the things that could happen. That
is a subset of the sample space omega. The
probability of the event is exactly what
you would think it would be. It's just the
sum of the probabilities of all of the
outcomes contained in that event, right?
So an event is just a bunch of stuff that
might happen, we know the probability of
each individual thing that could happen,
we add them up to get the probability of
an event. So the next two quizzes are
meant to give you some practice with these
concepts and in particular, the last few
to compute the probability of events in
our two running examples. So in the first
quiz, this is our first running example
where we think about two dice and we have
our 36 possible outcomes. Consider the
subset of outcomes in which the sum of the
two dice equals seven. What is the
probability of that event? Right, so the
correct answer is the third one, the
probability is one over six. Why is that?
Well, first, let's be more precise about
what this event is. What are the outcomes
in which the sum of the dices equal to
seven. Well there's exactly six such
outcomes. One six. Two five. Three four.
Four three. Five two and six one. Each of
the 36 outcomes is equally likely has a
probability of one over 36. So we have six
members of the set, usually [inaudible]
one over 36, so the probability is one
sixth. Let's move on to the next quiz,
which considers our second running
example, namely the randomly chosen pivot
in the outermost call to quick sort on an
input array of length n. So, recall that
in quick sort, when you choose a pivot,
you then partition the array around the
pivot, and this splits the input array
into two sub arrays. A left one, elements
less that the pivot, and a right one,
those bigger than the pivot. And the more
balanced the split into these two sub
problems, the better. So, ideally, we'd
like a fifty, fifty split. So what this
quiz asks you is what fraction of pivots,
that is what's probability that a randomly
chosen pivot. [inaudible] will give you a
reasonably good split, meaning both of the
sub-problems have size at least 25%. That
is you get a split 25/75 or better. That's
what this quiz asks about, what's the
probability that the randomly chosen pivot
satisfies that property? So the correct
answer to this quiz is again, the third
option. It's a 50 percent probability you
get a 25/75 split or better. So to see
why, let's, again, be precise about what
is the event that we're talking about?
Then we'll compute its probability. So,
when does a pivot give you a 25/75 split
or better? Well, for concreteness, suppose
the array contained just the integers
between one and 100. Now, what's the
property we want? We want that both of the
two sub arrays, have at least 25 percent
of the elements. Neither one has more than
75 percent of the elements. Well, if we
choose an element that's 26 or bigger in
value, then the left sub problem will have
at least 25 elements. It's the numbers one
through 25, and if we choose an index
which is, choose an element that's at most
75, then the right sub-array's going to
have at least 25 elements in the numbers
76 to 100. So anything between 26 and 75,
inclusive is going to give us a 25-75
split, more generally. Any pivot from the
middle 50 percent of the quan, of the
quintiles is gonna give us the desired
split. So we do badly if we get something
within the first quarter, we do badly if
we get something within the last quarter.
Anything in the middle works. So more
formally we can say that the event s that
we're analyzing is among the possible
pivot choices. We're interested in the
ones that is not in the first quarter or
not in the last quarter. Now the
coordinately of; the number of pivots in
this set is essentially one-half of the
overall pivot choices. I'm ignoring,
fractions here for simplicity. So the
probability of this event is the,
coordinality of this. Divided by or times
the probability of each of the individual
outcomes. And since we choose the pivot
unit from them at random each one has the
probability of one over n so you get n
over two divided by n or one half. Okay,
now that we've explored the concept of
events in our running two examples, we see
that the probability that the sum of two
dice is equal to one-sixth. A useful fact
to know if you're every playing craps. And
we know that a pivot gives us a 25, 75
split or better in randomized quick sort
with fifteen percent probability, a useful
fact if you want to develop intuition for
why quick sort is in fact quick. So that's
events. Let's move on to random variables.
So random variables are basically some
statistic measuring what happens in the
random outcome. So formally if we want to
define it, it's a real value function
defined on the sample space omega. So
given an outcome, given a realization of
the randomness this gives you back a
number. Now the random variable that we
most often care about in algorithm design
is the running time of a randomized
algorithm. That's the case for example
with the quick sort algorithm. So notice
that is in fact a random variable. If we
know the state of the world, if we know
the outcome of all of the coin flips that
our code is going to make, then there is
just some running time of our algorithm.
So in that sense it's a random variable.
Given the outcomes of the coin flips, out
pops a number, the running time, say in
milliseconds of the algorithm. Here I'm
gonna give you a couple more modest
examples of random variables in our two
running examples. If we're rolling two
dice, one very simple random variable
takes as input the outcome, so that the
result of the two dice, and spits out, the
sum. So that's certainly a reign of
variable. In any given outcome it's gonna
take on some integer value between two at
the minimum and twelve at the maximum. Our
second running example is the randomly
chosen pivot made by the outermost
[inaudible] quick sort. So let's think
about the random variable which is the
size, meaning the sub array length past to
the first recursive call. Equivalently
this reign of variable is the number of
elements of the input array smaller than
the randomly chosen pivot. So, this is the
random variable that takes on some
integral value between zero at the
smallest, that's if we happen to pick the
pivot equal to the minimum of the array
and n minus one at the largest. That's if
we happen to pick the maximum element as
the pivot element. Next let's talk about
the expectation of a random variable. This
is really nothing more than the average,
and of course when you take the average of
some statistic, you want to do it weighted
by the probability of its various values.
So let's just make that precise real
quick. So consider some random variable,
capital x, the expectation, this is also
called the expected value. And the
notation is capital e square bracket, then
of the random variable. And again in
English the expectation is just the
average value, naturally weighted by the
probability of the various possible
outcomes. Or more mathematically we sum,
over everything that could happen so that
little I denote one possible outcome. We
look at the value of this random variable,
when that outcome occurs and then we
weight it times the probability of that
outcome occurring. So the next two quizzes
ask you to compute the expectation of the
two random variables that we identified on
the previous slide. So the first quiz is
about two dice and the random variable,
which is the sum of the values of those
two dice. What is the average value of
that random variable? What is its
expectation? Okay, so the answer to this
question is the second option. The average
value is seven. There's a bunch of
different ways to see that. In my opinion,
the best way to compute this is using
linearity of expectation, which is the
next concept we're gonna cover. If you
wanted to, you can just compute this by
brute force by which I mean you could
iterate over all 36 possible outcomes,
look at the value of the two dice in each
and just evaluate that sum we had in the
definition on the last slide. A slightly
sneakier way to do it if you don't know
linearity of expectation would be to pair
up the various outcomes. So, it's equally
likely that sum of the two dice is two. Or
twelve. It's equally likely to be three.
Or eleven. Four and ten and so on. Each
way of paring up these values of the two
dice results in fourteen. And when you
average you get seven. But again, this,
the right way to do this is linear of
expectation which we'll cover next. So,
the second quiz covers the second random
variable we identified. So now we're back
to Quicksort and the random pivot shows in
the outermost [inaudible] and the question
is how big, on average, an expectation is
the sub-array in the first recursive call?
Equivalently, on average how many elements
are going to be less than the randomly
chosen pivot? So the correct answer to
this quiz is the third option. In fact,
it's actually quantity N minus one over
two, not N over two, but basically half of
the elements. Again, there's sort of a
sneaky way to see this, if you want, which
is that clearly the two recursive calls
are symmetric. The expected value of the
left recursive call is going to be the
same as the size of the right recursive
call. The two recursive calls always
comprise N minus one of the elements, so
because they're symmetric, you expect half
in each, so N minus one over two in each.
Though for this problem, I think it's
perfectly fine just to compute this using
the definition of expectation. So if we
let X denote the random variable that we
care about; the sub array size. Then we
can just compute directly by summing over
all of the possible outcomes, all of the
possible choices of the pivot. So with
probability one over N we chose the
minimum of the pivot resulting in zero
elements being passed to the first
recursive call. With probability one over
N we pick the second smallest element
resulting in one element being passed to
the first recursive call. If probability
one over N we pick the third smallest
given us a summary size of two. And so on.
And then, with probability one over N, we
picked the maximum element, giving us the
sub array size of N-1. So if you just sort
of compute this sum out, you will get, as
expected, N-1 over two. So expectation is
the last definition that I'm going to give
you in this part one of the probability
review. Next if our fifth and final
concept for this video, which is linearity
of expectation. And that's not a
definition, that's more of a theorem. So
what is linearity of expectation? Well
this is a very simple property of random
variables that's super, super important.
This comes up all the time when we analyze
randomized algorithms and random processes
more generally. So what is linearity of
expectation? Well it's the following very
simple claim. Which I'll sometimes denote
just by line X for short. Suppose you've
got a bunch of random of variables defined
on the same sample space. Then if you want
to think of the expected value of the sum
of these random of variables. It doesn't
matter if you take the sum first and then
take the expectation. Or if you take the
expectations first and then the sum. That
is the expected value of a sum of random
variables. Is equal to, the sum of the
expectations of the individual random
variables. And one of the reasons lean,
linearity of expectation is so
ubiquitously useful is because it always
works no matter what these random
variables are, in particular, even when
the random variables are not independent.
Now, I haven't defined independent random
variables yet. That will come in part two
of the probability review, but hopefully
you have an intuitive sense of what
independence means. So things are
independent if knowing something about one
of the random variables doesn't influence
what you expect from the other random
variable. Now I realize the first time you
see linearity of expectation, it's a
little hard to appreciate. So first of all
as far as applications, we'll see plenty
throughout this course. Pretty much every
single application of probability that
we'll see, the analysis will involve
linearity of expectation. But it may be
hard to appreciate why this is not a
tautology. Just symbolically it may look
like it has to be true. But to point out
that there is content here. If I replace
the sums by products, then this equation
would in general be false. If the reign of
variables are not independent. So the same
thing is not true about products. It's
really about sums. [sound] So let me just
give you a trivial illustration of
linearity of expectation, point out how it
really easily allows us to evaluate the
sum of two dice. Doing our first running
example let's introduce the first random
variables X one and X two for the results
of the first and second die, respectively.
Now computing the expected value of a
single die is easy. There's only six
values to enumerate over. Contrast that
with the 36 outcomes to numerate over when
we evaluated the sum of the two dies. So
the average value of a single die, you
won't be surprised to hear, is 3.5, right?
So it ranges, integers between one and six
uniformly, so 3.5 on average and now,
using [inaudible] expectation, the sum of
two dice is simply double the average
value of a single one. So in the next
slide, I'm going to prove this property,
prove linearity of expectation. But
frankly, the proof is, is pretty trivial.
So if you don't care about the proof,
that's fine. You can skip it without loss.
I'm including it just for completeness.
And I gotta say, I don't know of another
mathematical statement which is
simultaneously so trivial to prove, and so
unbelievably useful. There's really
something remarkable about linearity of
expectation. So how does the proof go?
Well, honestly, we just write out the sum,
the definition of an expectation. We
reverse the sums, and we're done. So, let
me start with the right-hand side of the
equation. So that's, that was the sum of
the expectations of the random variables.
So now, let's just apply the definition of
expectation. So, it's just a weighted
average over the possible outcomes. And
now, instead of summing first over the,
the reign of variable j and then over the
realized outcome I, I am going to, to it
in a reversed order. I'm going to sum
first over the outcome I and then over the
random variable j. Now the probability of
outcome I is independent of j. So we can
yank the p of I outside of that inter-sum.
But now what have we got? So inside the
parentheses, we simply have the value of
the sum of the X-I's, X-X-J's on the
outcome I. And over here, we're just
averaging the sum of the X-J's with
respect to the probabilities, the P-I's.
So this is just the definition of the
expectation of the sum of the reign of
variables. So that's it. So linearity of
expectation is really just a reversal of
the double sums. Now for those of you who
are rusty on these kind of manipulations,
I just want to point out, you know this
reversal of the double sum itself, is
there's nothing complicated at all, about
what's going on. So if you want a really
pedestrian way to think about what's
happening, just imagine that we take these
sum adds, these [inaudible]. And we just
write them out in a grid, where one, or
let's just say the columns are indexed by
the random variable J and the rows are
indexed by the outcome I. And in a given
cell of this grid, we'd just write the sum
in XJI times PI And the point is both of
the double sums, both the first one and
the second one, are just the sum of all of
the entries in this grid. In the first sum
we first group things according to the row
and then sum each of them up and the
second sum, we group things according to
the column. We do column sums and then sum
up the column sums. But either way it's
just some of the entries in the grid, so
it doesn't matter which way you sum. Okay?
So reversal of double sums is just a
trivial thing. So if you get lost in the
notation with these double sums, the point
is, you can just interpret each of them in
terms of this grid. Both of these double
sums are nothing more than the sum of the
values in all of the cells of this grid.
One order of summation just says you group
first according to row sums, and then sum
those up. That's the first summation. The
second summation, you first take column
sums, and then sum those up. But of course
it doesn't matter. You just get the
results of everything in the grid, okay?
So there's no, [inaudible], no tricks up
my sleeve when I reverse these sums. It's
a totally elementary trivial thing, okay?
So, again, linearity expectation. Trivial
to prove. Incredibly useful, don't forget
it. So I want to conclude this video with
one final example in order to tie together
all of the concepts that we just learned
or just reviewed. And, it's going to be an
example of load balancing. A sign in
process used to servers. But this in fact
is quite important for the analysis of
hashing that we're going to see toward the
end of the course as well. But for now
let's just think about the, the following
simple problem. For some integer n you
have n computer processes would have to be
assigned to n servers in some way. Now
you're feeling really lazy. Okay. So
you're just gonna take each of the
processes and you're just gonna assign it
to a totally random server. Okay? With
each server equally likely to get a given
process. And the question I want to study
is, does this laziness cost you, at least
on average? So if you look at a server,
what's the expected load? So let's proceed
to the solution, the answer to this
question. So, before you start talking
about expectations, one has to be clear
about the sample space and what are the
probabilities of the various outcomes? So
remember the sample space omega just
denotes every possible thing that can
happen. So what are we doing? For each
process, we're assigning it to a random
server, so all of the things that can
happen are all of the different
assignments of these end processes to
these end servers and if you think about
it, there are N raised to the N possible
outcomes, because you have N choices for
each of the N processes. Moreover, because
each process is assigned to one of the
servers uniformly at random, each of these
[inaudible] assignments is equally likely.
Probability one over N to the N. Now that
we have a sample space, we're in a
position to define a random variable. And
we already know what random variable we
care about. We care about the average load
of a server. Now, all of the servers are
exactly the same, so we just have to focus
on one server, let's say the first server.
And look at the number of processes
assigned to it. And if you go back to the
problem statement what we're asked is to
compute the expected value of Y. The
expected number of processes assigned to a
server. [sound] Now, of course, in
principle we could go to the definition of
expectation and just compute by brute
force the sum over all possible outcomes
of the value of y and take the average.
Unfortunately there are end of the end
different outcomes. And that's a lot. So,
what can we do other than this brute force
computation? Well, recall our example of
linear of expectation and the sum of two
dice. We observed that instead of com.
Computing the sum by enumerating all
thirty-six outcomes. It was much better to
focus on a single dye, compute its
expectation and then conclude with
linearity of expectation. So we'll do the
same thing here, instead of focusing on
the sum Y, we'll focus on constituent
parts of Y. So whether or not a single
process gets assigned to the first server
and then we'll get away with that with
linearity of expectation. So, more
precisely, for a given process, J, let's
define XJ to be whether to be one if and
only if the j process gets assigned to the
first server, zero otherwise. 01 random
variables like XJ are often called
indicator random variables. That's because
they, in effect, indicate whether or not a
certain event occurs. In this case,
whether or not the [inaudible] process
gets assigned to the first server. Why did
I make this definition? Well, observe that
the total number of processes that gets
assigned to the first server is simply the
sum from J equal one to N of XJ. Xj Says
whether or not a given process; the Jade
process is on the first server. The total
number is the sum of these over all J. Now
the benefit from this maneuver is we only
to compute the expectation of a extremely
simple indicator random variable XJ. This
is like the win that we got when we're
summing up two dice by instead of having
to compute the sum, the expected value of
the sum, we just had to focus on the
expectation of a single die. That was
really easy. Similarly here the
expectation of a single XJ is really easy.
Specifically, let's write it out just
using the definition of the expectation.
So the expected value of an X J is. Well.
Let's group together all the outcomes in
which it takes on the value zero. So, the
contributions of the expectation is zero
for all of those outcomes. And then
there's the rest of the outcome where X J
takes on the value one. And in those cases
it contributes one to the expectation.
Now, obviously, we get some happy
cancellation happening here, with the zero
part, and all we have to worry about is
the probability that XJ takes on the value
one. Okay, what was XJ again? How did we
define it? Remember it's the event that,
it's one exactly when the Jth process gets
assigned to the first server. How are
processes assigned. Or, remember the
proposed solution assigned to each
process, to each of the end servers,
equally likely, with uniform probability.
So, the probability that the Jth process
gets assigned to the first server, is one
over N. So this leaves us with just the
sum, from j equal one to n, of one over n.
That is, we just sum up one over n with
itself n times. This of course is equal to
one. So summarizing, the expected number
of processes assigned to a given server is
exactly one. So really on average we don't
lose much for our laziness of just
randomly spraying the process to the
servers. On average for any given server
we do fine. And that's a good illustration
of the role that randomness plays in a lot
of algorithm designed in computer science
more generally. Often using random choices
you can do surprisingly well with very
little work and that's indeed part of the
crux of quick short, if you choose pivots
at random you get an average running time
of N log N just as good as. So at the end
of the day what we find is that the
expected number of processes assigned to a
given server; say the first server, is
just one. So, at least if we only care
about averages, we lose very little from
this trivial process of randomly spraying
the process through the server. On average
any given server has just one process on
it. This is characteristic of the role
that randomization plays in algorithm
design and computer science more
generally. Often we can get away with
really simple heuristics just by making
random choices. Of course, quicksort is
one example of that, where we get an
extremely prevalently used practical
sorting algorithm just by making randomly
chosen pivots in every recursive call.