1
00:00:00,000 --> 00:00:03,922
Welcome to part one of our probability
review. The first time that we need these

2
00:00:03,922 --> 00:00:07,616
concepts in this course is for those that
wanna understand the analysis of Quick

3
00:00:07,616 --> 00:00:11,218
Sort. Why it runs in big O of end log and
time on average. And these topics will

4
00:00:11,218 --> 00:00:14,821
also come up a couple of other times in
the course. For example when we study a

5
00:00:14,821 --> 00:00:18,606
randomized algorithm for the minimum cut
problem in graphs. And also when we try to

6
00:00:18,606 --> 00:00:22,331
understand the performance of hashing.
Here are the topics we're going to cover.

7
00:00:22,477 --> 00:00:26,605
We'll start at the beginning with sample
spaces, and then we'll discuss, events and

8
00:00:26,605 --> 00:00:30,297
their probabilities. We'll talk about
random variables, which are real valued

9
00:00:30,297 --> 00:00:34,182
functions on a sample space. We'll talk
about expectation, which is basically the

10
00:00:34,182 --> 00:00:38,068
average value of a random variable. We'll
identity and prove a very important

11
00:00:38,068 --> 00:00:41,614
property, called the linearity of
expectation, which will come up over and

12
00:00:41,614 --> 00:00:45,548
over again in our analyses of randomized
processes. So that's gonna be the topics

13
00:00:45,548 --> 00:00:49,434
for part one, that will conclude the video
with one example tying these concepts

14
00:00:49,434 --> 00:00:53,320
together in load balancing. And, this
video is by no means the only source you

15
00:00:53,320 --> 00:00:56,709
can turn to, to learn. About these
concepts. A couple of other sources I

16
00:00:56,709 --> 00:01:00,310
recommend are the online lecture notes by
Eric [inaudible] and Tom [inaudible].

17
00:01:00,310 --> 00:01:03,865
Also, there's a wiki book on discreet
probability, which you could, check out.

18
00:01:03,865 --> 00:01:07,420
And I wanna be clear, this is really not
meant to be a course or a tutorial on

19
00:01:07,420 --> 00:01:10,975
probability concepts. It's really only
meant to be a refresher. So I'm gonna go

20
00:01:10,975 --> 00:01:14,803
at a reasonably fast pace, and it's gonna
be a pretty cursory presentation. And if

21
00:01:14,803 --> 00:01:18,313
you want a more thorough review, I would,
I would check out one of these other

22
00:01:18,313 --> 00:01:22,004
sources, or your favorite book on discreet
probability. And along those same lines,

23
00:01:22,004 --> 00:01:25,650
I'm thinking that many of you have seen
some of this material before. Don't feel

24
00:01:25,650 --> 00:01:29,183
compelled, like, to, to watch this video
straight [inaudible]. Beginning to the

25
00:01:29,183 --> 00:01:33,154
end. Feel free to just sort of dip in, and
review the concepts, that you need a

26
00:01:33,154 --> 00:01:41,488
refresher on. So, let's start at the
beginning with sample spaces. So, what is

27
00:01:41,488 --> 00:01:45,465
a sample space? Well we're analyzing
random processes, so any number of things

28
00:01:45,465 --> 00:01:49,649
can happen. And the sample space is just a
collection of all of those things that

29
00:01:49,649 --> 00:01:53,575
could happen. So this is basically the
universe in which we're gonna discuss

30
00:01:53,575 --> 00:02:01,455
probabilities and average values. So I'll
often use annotation bigo mega to describe

31
00:02:01,455 --> 00:02:06,439
the sample space. So one thing that we've
got going for us in the design of

32
00:02:06,439 --> 00:02:10,410
algorithm is typically we can take omega
to be a finite set. So that's why we're

33
00:02:10,410 --> 00:02:14,133
only dealing only with discreet
probability which is a very happy thing.

34
00:02:14,133 --> 00:02:19,620
Cause that's much more elementary then
more general probability. So I realize

35
00:02:19,620 --> 00:02:23,967
this is sort of a super abstract concept,
and the next few definitions will also be

36
00:02:23,967 --> 00:02:28,105
abstract, so I'm going to use two very
simple, very concrete running examples to

37
00:02:28,105 --> 00:02:32,400
illustrate the next few concepts. So the
first example is just going to be, we take

38
00:02:32,400 --> 00:02:36,485
two six sided dice and we roll them. So
then of course what's the sample space?

39
00:02:36,485 --> 00:02:42,263
Well, it's just all of the possible
outcomes of just two dice. In addition to

40
00:02:42,263 --> 00:02:46,651
defining the outcomes; everything that
could possibly happen, we need to define

41
00:02:46,651 --> 00:02:53,346
what is the probability of each individual
outcome. So, of course, the probability of

42
00:02:53,346 --> 00:02:57,261
each outcome should be at least zero,
should be non negative and there's also

43
00:02:57,261 --> 00:03:01,532
the obvious constraint that the sum of the
probability should be one. So, exactly one

44
00:03:01,532 --> 00:03:06,429
thing's gonna happen. Now I realize this
is a, a super abstract concept and the

45
00:03:06,429 --> 00:03:10,512
next few definitions are also a little
abstract, so throughout them I'm going to

46
00:03:10,512 --> 00:03:14,800
use two really simple, really concrete
examples to illustrate what these concepts

47
00:03:14,800 --> 00:03:18,832
mean. So, the first example is just gonna
be you take two six sided dice, and you

48
00:03:18,832 --> 00:03:22,915
roll them, and then of course, the sample
space is just the 36 different outcomes

49
00:03:22,915 --> 00:03:28,172
you could have of these two dice. And
assuming that each of these two dice is

50
00:03:28,172 --> 00:03:32,930
well crafted, then we expect each of these
36 outcomes to be equally likely to occur

51
00:03:32,930 --> 00:03:38,991
with a probability of 1/36. The second
running example I'm going to use is more

52
00:03:38,991 --> 00:03:42,878
directly related to algorithms and it's
motivated by the quick sort algorithm.

53
00:03:42,878 --> 00:03:46,516
Recall that we're studying the
implementation of quick sort that chooses

54
00:03:46,516 --> 00:03:50,453
a pivot uniformly at random in every
recursive call. So let's just focus on the

55
00:03:50,453 --> 00:03:54,489
very first outermost call of quick sort
and think about the random choice of the

56
00:03:54,489 --> 00:03:58,326
pivot just in that call. So then the
sample space, all of the different things

57
00:03:58,326 --> 00:04:02,213
that could happen is just all of the N
different choices for a pivot, assuming

58
00:04:02,213 --> 00:04:07,543
the array has length N. So we can
represent the sample space just as

59
00:04:07,543 --> 00:04:11,862
integers one, two all the way up to N,
corresponding to the array index, of the

60
00:04:11,862 --> 00:04:15,956
randomly chosen pivot. And again, by
definition, by the construction of our

61
00:04:15,956 --> 00:04:20,443
code, each of these events, each of these
things is equally likely, probability of

62
00:04:20,443 --> 00:04:27,119
one over N. Now let's talk about events.
An event is nothing more than a subsets of

63
00:04:27,119 --> 00:04:33,432
all of the things that could happen. That
is a subset of the sample space omega. The

64
00:04:33,432 --> 00:04:37,437
probability of the event is exactly what
you would think it would be. It's just the

65
00:04:37,437 --> 00:04:41,201
sum of the probabilities of all of the
outcomes contained in that event, right?

66
00:04:41,201 --> 00:04:45,157
So an event is just a bunch of stuff that
might happen, we know the probability of

67
00:04:45,157 --> 00:04:49,065
each individual thing that could happen,
we add them up to get the probability of

68
00:04:49,065 --> 00:04:54,981
an event. So the next two quizzes are
meant to give you some practice with these

69
00:04:54,981 --> 00:04:59,658
concepts and in particular, the last few
to compute the probability of events in

70
00:04:59,658 --> 00:05:04,167
our two running examples. So in the first
quiz, this is our first running example

71
00:05:04,167 --> 00:05:08,675
where we think about two dice and we have
our 36 possible outcomes. Consider the

72
00:05:08,675 --> 00:05:13,014
subset of outcomes in which the sum of the
two dice equals seven. What is the

73
00:05:13,014 --> 00:05:19,213
probability of that event? Right, so the
correct answer is the third one, the

74
00:05:19,213 --> 00:05:23,910
probability is one over six. Why is that?
Well, first, let's be more precise about

75
00:05:23,910 --> 00:05:32,170
what this event is. What are the outcomes
in which the sum of the dices equal to

76
00:05:32,170 --> 00:05:39,004
seven. Well there's exactly six such
outcomes. One six. Two five. Three four.

77
00:05:39,004 --> 00:05:45,881
Four three. Five two and six one. Each of
the 36 outcomes is equally likely has a

78
00:05:45,881 --> 00:05:51,065
probability of one over 36. So we have six
members of the set, usually [inaudible]

79
00:05:51,065 --> 00:05:56,413
one over 36, so the probability is one
sixth. Let's move on to the next quiz,

80
00:05:56,413 --> 00:06:00,514
which considers our second running
example, namely the randomly chosen pivot

81
00:06:00,514 --> 00:06:04,886
in the outermost call to quick sort on an
input array of length n. So, recall that

82
00:06:04,886 --> 00:06:09,095
in quick sort, when you choose a pivot,
you then partition the array around the

83
00:06:09,095 --> 00:06:13,305
pivot, and this splits the input array
into two sub arrays. A left one, elements

84
00:06:13,305 --> 00:06:17,460
less that the pivot, and a right one,
those bigger than the pivot. And the more

85
00:06:17,460 --> 00:06:21,454
balanced the split into these two sub
problems, the better. So, ideally, we'd

86
00:06:21,454 --> 00:06:25,717
like a fifty, fifty split. So what this
quiz asks you is what fraction of pivots,

87
00:06:25,717 --> 00:06:30,393
that is what's probability that a randomly
chosen pivot. [inaudible] will give you a

88
00:06:30,393 --> 00:06:35,236
reasonably good split, meaning both of the
sub-problems have size at least 25%. That

89
00:06:35,236 --> 00:06:39,902
is you get a split 25/75 or better. That's
what this quiz asks about, what's the

90
00:06:39,902 --> 00:06:46,930
probability that the randomly chosen pivot
satisfies that property? So the correct

91
00:06:46,930 --> 00:06:51,285
answer to this quiz is again, the third
option. It's a 50 percent probability you

92
00:06:51,285 --> 00:06:55,702
get a 25/75 split or better. So to see
why, let's, again, be precise about what

93
00:06:55,702 --> 00:07:00,119
is the event that we're talking about?
Then we'll compute its probability. So,

94
00:07:00,119 --> 00:07:04,827
when does a pivot give you a 25/75 split
or better? Well, for concreteness, suppose

95
00:07:04,827 --> 00:07:09,070
the array contained just the integers
between one and 100. Now, what's the

96
00:07:09,070 --> 00:07:13,399
property we want? We want that both of the
two sub arrays, have at least 25 percent

97
00:07:13,429 --> 00:07:17,613
of the elements. Neither one has more than
75 percent of the elements. Well, if we

98
00:07:17,613 --> 00:07:22,379
choose an element that's 26 or bigger in
value, then the left sub problem will have

99
00:07:22,379 --> 00:07:27,439
at least 25 elements. It's the numbers one
through 25, and if we choose an index

100
00:07:27,439 --> 00:07:32,801
which is, choose an element that's at most
75, then the right sub-array's going to

101
00:07:32,801 --> 00:07:38,164
have at least 25 elements in the numbers
76 to 100. So anything between 26 and 75,

102
00:07:38,164 --> 00:07:43,068
inclusive is going to give us a 25-75
split, more generally. Any pivot from the

103
00:07:43,068 --> 00:07:46,642
middle 50 percent of the quan, of the
quintiles is gonna give us the desired

104
00:07:46,642 --> 00:07:50,900
split. So we do badly if we get something
within the first quarter, we do badly if

105
00:07:50,900 --> 00:07:56,204
we get something within the last quarter.
Anything in the middle works. So more

106
00:07:56,204 --> 00:08:02,454
formally we can say that the event s that
we're analyzing is among the possible

107
00:08:02,454 --> 00:08:08,625
pivot choices. We're interested in the
ones that is not in the first quarter or

108
00:08:08,625 --> 00:08:15,016
not in the last quarter. Now the
coordinately of; the number of pivots in

109
00:08:15,016 --> 00:08:19,199
this set is essentially one-half of the
overall pivot choices. I'm ignoring,

110
00:08:19,366 --> 00:08:26,320
fractions here for simplicity. So the
probability of this event is the,

111
00:08:26,320 --> 00:08:31,107
coordinality of this. Divided by or times
the probability of each of the individual

112
00:08:31,107 --> 00:08:34,985
outcomes. And since we choose the pivot
unit from them at random each one has the

113
00:08:34,985 --> 00:08:42,148
probability of one over n so you get n
over two divided by n or one half. Okay,

114
00:08:42,148 --> 00:08:46,493
now that we've explored the concept of
events in our running two examples, we see

115
00:08:46,493 --> 00:08:51,053
that the probability that the sum of two
dice is equal to one-sixth. A useful fact

116
00:08:51,053 --> 00:08:55,506
to know if you're every playing craps. And
we know that a pivot gives us a 25, 75

117
00:08:55,506 --> 00:08:59,691
split or better in randomized quick sort
with fifteen percent probability, a useful

118
00:08:59,691 --> 00:09:04,197
fact if you want to develop intuition for
why quick sort is in fact quick. So that's

119
00:09:04,197 --> 00:09:08,489
events. Let's move on to random variables.
So random variables are basically some

120
00:09:08,489 --> 00:09:13,090
statistic measuring what happens in the
random outcome. So formally if we want to

121
00:09:13,090 --> 00:09:17,787
define it, it's a real value function
defined on the sample space omega. So

122
00:09:17,787 --> 00:09:22,675
given an outcome, given a realization of
the randomness this gives you back a

123
00:09:22,675 --> 00:09:30,982
number. Now the random variable that we
most often care about in algorithm design

124
00:09:30,982 --> 00:09:34,664
is the running time of a randomized
algorithm. That's the case for example

125
00:09:34,664 --> 00:09:38,645
with the quick sort algorithm. So notice
that is in fact a random variable. If we

126
00:09:38,645 --> 00:09:42,675
know the state of the world, if we know
the outcome of all of the coin flips that

127
00:09:42,675 --> 00:09:46,656
our code is going to make, then there is
just some running time of our algorithm.

128
00:09:46,656 --> 00:09:50,687
So in that sense it's a random variable.
Given the outcomes of the coin flips, out

129
00:09:50,687 --> 00:09:56,427
pops a number, the running time, say in
milliseconds of the algorithm. Here I'm

130
00:09:56,427 --> 00:10:00,540
gonna give you a couple more modest
examples of random variables in our two

131
00:10:00,540 --> 00:10:05,740
running examples. If we're rolling two
dice, one very simple random variable

132
00:10:05,740 --> 00:10:10,948
takes as input the outcome, so that the
result of the two dice, and spits out, the

133
00:10:10,948 --> 00:10:17,585
sum. So that's certainly a reign of
variable. In any given outcome it's gonna

134
00:10:17,585 --> 00:10:22,632
take on some integer value between two at
the minimum and twelve at the maximum. Our

135
00:10:22,632 --> 00:10:27,438
second running example is the randomly
chosen pivot made by the outermost

136
00:10:27,438 --> 00:10:32,065
[inaudible] quick sort. So let's think
about the random variable which is the

137
00:10:32,065 --> 00:10:36,871
size, meaning the sub array length past to
the first recursive call. Equivalently

138
00:10:36,871 --> 00:10:41,737
this reign of variable is the number of
elements of the input array smaller than

139
00:10:41,737 --> 00:10:46,616
the randomly chosen pivot. So, this is the
random variable that takes on some

140
00:10:46,616 --> 00:10:50,776
integral value between zero at the
smallest, that's if we happen to pick the

141
00:10:50,776 --> 00:10:55,210
pivot equal to the minimum of the array
and n minus one at the largest. That's if

142
00:10:55,210 --> 00:11:00,511
we happen to pick the maximum element as
the pivot element. Next let's talk about

143
00:11:00,511 --> 00:11:04,506
the expectation of a random variable. This
is really nothing more than the average,

144
00:11:04,506 --> 00:11:08,647
and of course when you take the average of
some statistic, you want to do it weighted

145
00:11:08,647 --> 00:11:12,447
by the probability of its various values.
So let's just make that precise real

146
00:11:12,447 --> 00:11:19,363
quick. So consider some random variable,
capital x, the expectation, this is also

147
00:11:19,363 --> 00:11:25,594
called the expected value. And the
notation is capital e square bracket, then

148
00:11:25,594 --> 00:11:31,366
of the random variable. And again in
English the expectation is just the

149
00:11:31,366 --> 00:11:35,813
average value, naturally weighted by the
probability of the various possible

150
00:11:35,813 --> 00:11:42,152
outcomes. Or more mathematically we sum,
over everything that could happen so that

151
00:11:42,152 --> 00:11:47,427
little I denote one possible outcome. We
look at the value of this random variable,

152
00:11:47,427 --> 00:11:52,381
when that outcome occurs and then we
weight it times the probability of that

153
00:11:52,381 --> 00:11:58,175
outcome occurring. So the next two quizzes
ask you to compute the expectation of the

154
00:11:58,175 --> 00:12:03,048
two random variables that we identified on
the previous slide. So the first quiz is

155
00:12:03,048 --> 00:12:07,687
about two dice and the random variable,
which is the sum of the values of those

156
00:12:07,687 --> 00:12:11,915
two dice. What is the average value of
that random variable? What is its

157
00:12:11,915 --> 00:12:20,739
expectation? Okay, so the answer to this
question is the second option. The average

158
00:12:20,739 --> 00:12:24,738
value is seven. There's a bunch of
different ways to see that. In my opinion,

159
00:12:24,738 --> 00:12:28,634
the best way to compute this is using
linearity of expectation, which is the

160
00:12:28,634 --> 00:12:32,582
next concept we're gonna cover. If you
wanted to, you can just compute this by

161
00:12:32,582 --> 00:12:36,479
brute force by which I mean you could
iterate over all 36 possible outcomes,

162
00:12:36,479 --> 00:12:40,734
look at the value of the two dice in each
and just evaluate that sum we had in the

163
00:12:40,734 --> 00:12:44,836
definition on the last slide. A slightly
sneakier way to do it if you don't know

164
00:12:44,836 --> 00:12:49,040
linearity of expectation would be to pair
up the various outcomes. So, it's equally

165
00:12:49,040 --> 00:12:54,260
likely that sum of the two dice is two. Or
twelve. It's equally likely to be three.

166
00:12:54,260 --> 00:12:59,360
Or eleven. Four and ten and so on. Each
way of paring up these values of the two

167
00:12:59,360 --> 00:13:03,622
dice results in fourteen. And when you
average you get seven. But again, this,

168
00:13:03,622 --> 00:13:07,910
the right way to do this is linear of
expectation which we'll cover next. So,

169
00:13:07,910 --> 00:13:11,760
the second quiz covers the second random
variable we identified. So now we're back

170
00:13:11,760 --> 00:13:15,797
to Quicksort and the random pivot shows in
the outermost [inaudible] and the question

171
00:13:15,797 --> 00:13:19,647
is how big, on average, an expectation is
the sub-array in the first recursive call?

172
00:13:19,647 --> 00:13:23,449
Equivalently, on average how many elements
are going to be less than the randomly

173
00:13:23,449 --> 00:13:29,966
chosen pivot? So the correct answer to
this quiz is the third option. In fact,

174
00:13:29,966 --> 00:13:34,084
it's actually quantity N minus one over
two, not N over two, but basically half of

175
00:13:34,084 --> 00:13:38,050
the elements. Again, there's sort of a
sneaky way to see this, if you want, which

176
00:13:38,050 --> 00:13:42,117
is that clearly the two recursive calls
are symmetric. The expected value of the

177
00:13:42,117 --> 00:13:46,133
left recursive call is going to be the
same as the size of the right recursive

178
00:13:46,133 --> 00:13:49,997
call. The two recursive calls always
comprise N minus one of the elements, so

179
00:13:49,997 --> 00:13:54,318
because they're symmetric, you expect half
in each, so N minus one over two in each.

180
00:13:54,318 --> 00:13:58,334
Though for this problem, I think it's
perfectly fine just to compute this using

181
00:13:58,334 --> 00:14:02,910
the definition of expectation. So if we
let X denote the random variable that we

182
00:14:02,910 --> 00:14:07,246
care about; the sub array size. Then we
can just compute directly by summing over

183
00:14:07,246 --> 00:14:11,251
all of the possible outcomes, all of the
possible choices of the pivot. So with

184
00:14:11,251 --> 00:14:15,101
probability one over N we chose the
minimum of the pivot resulting in zero

185
00:14:15,101 --> 00:14:19,002
elements being passed to the first
recursive call. With probability one over

186
00:14:19,002 --> 00:14:23,058
N we pick the second smallest element
resulting in one element being passed to

187
00:14:23,058 --> 00:14:27,062
the first recursive call. If probability
one over N we pick the third smallest

188
00:14:27,062 --> 00:14:32,102
given us a summary size of two. And so on.
And then, with probability one over N, we

189
00:14:32,102 --> 00:14:37,909
picked the maximum element, giving us the
sub array size of N-1. So if you just sort

190
00:14:37,909 --> 00:14:44,360
of compute this sum out, you will get, as
expected, N-1 over two. So expectation is

191
00:14:44,360 --> 00:14:49,340
the last definition that I'm going to give
you in this part one of the probability

192
00:14:49,340 --> 00:14:54,140
review. Next if our fifth and final
concept for this video, which is linearity

193
00:14:54,140 --> 00:14:58,400
of expectation. And that's not a
definition, that's more of a theorem. So

194
00:14:58,400 --> 00:15:03,140
what is linearity of expectation? Well
this is a very simple property of random

195
00:15:03,140 --> 00:15:08,060
variables that's super, super important.
This comes up all the time when we analyze

196
00:15:08,060 --> 00:15:12,980
randomized algorithms and random processes
more generally. So what is linearity of

197
00:15:12,980 --> 00:15:18,751
expectation? Well it's the following very
simple claim. Which I'll sometimes denote

198
00:15:18,751 --> 00:15:24,819
just by line X for short. Suppose you've
got a bunch of random of variables defined

199
00:15:24,819 --> 00:15:31,035
on the same sample space. Then if you want
to think of the expected value of the sum

200
00:15:31,035 --> 00:15:37,102
of these random of variables. It doesn't
matter if you take the sum first and then

201
00:15:37,102 --> 00:15:43,096
take the expectation. Or if you take the
expectations first and then the sum. That

202
00:15:43,096 --> 00:15:50,185
is the expected value of a sum of random
variables. Is equal to, the sum of the

203
00:15:50,185 --> 00:15:59,359
expectations of the individual random
variables. And one of the reasons lean,

204
00:15:59,359 --> 00:16:03,369
linearity of expectation is so
ubiquitously useful is because it always

205
00:16:03,369 --> 00:16:07,379
works no matter what these random
variables are, in particular, even when

206
00:16:07,379 --> 00:16:11,946
the random variables are not independent.
Now, I haven't defined independent random

207
00:16:11,946 --> 00:16:16,625
variables yet. That will come in part two
of the probability review, but hopefully

208
00:16:16,625 --> 00:16:20,468
you have an intuitive sense of what
independence means. So things are

209
00:16:20,468 --> 00:16:25,202
independent if knowing something about one
of the random variables doesn't influence

210
00:16:25,202 --> 00:16:30,411
what you expect from the other random
variable. Now I realize the first time you

211
00:16:30,411 --> 00:16:34,455
see linearity of expectation, it's a
little hard to appreciate. So first of all

212
00:16:34,455 --> 00:16:38,810
as far as applications, we'll see plenty
throughout this course. Pretty much every

213
00:16:38,810 --> 00:16:42,699
single application of probability that
we'll see, the analysis will involve

214
00:16:42,699 --> 00:16:46,639
linearity of expectation. But it may be
hard to appreciate why this is not a

215
00:16:46,639 --> 00:16:50,787
tautology. Just symbolically it may look
like it has to be true. But to point out

216
00:16:50,787 --> 00:16:55,091
that there is content here. If I replace
the sums by products, then this equation

217
00:16:55,091 --> 00:16:59,446
would in general be false. If the reign of
variables are not independent. So the same

218
00:16:59,446 --> 00:17:03,947
thing is not true about products. It's
really about sums. [sound] So let me just

219
00:17:03,947 --> 00:17:08,505
give you a trivial illustration of
linearity of expectation, point out how it

220
00:17:08,505 --> 00:17:13,276
really easily allows us to evaluate the
sum of two dice. Doing our first running

221
00:17:13,276 --> 00:17:18,312
example let's introduce the first random
variables X one and X two for the results

222
00:17:18,312 --> 00:17:23,105
of the first and second die, respectively.
Now computing the expected value of a

223
00:17:23,105 --> 00:17:27,656
single die is easy. There's only six
values to enumerate over. Contrast that

224
00:17:27,656 --> 00:17:32,712
with the 36 outcomes to numerate over when
we evaluated the sum of the two dies. So

225
00:17:32,712 --> 00:17:37,977
the average value of a single die, you
won't be surprised to hear, is 3.5, right?

226
00:17:37,977 --> 00:17:43,242
So it ranges, integers between one and six
uniformly, so 3.5 on average and now,

227
00:17:43,242 --> 00:17:48,575
using [inaudible] expectation, the sum of
two dice is simply double the average

228
00:17:48,575 --> 00:17:52,828
value of a single one. So in the next
slide, I'm going to prove this property,

229
00:17:52,828 --> 00:17:56,776
prove linearity of expectation. But
frankly, the proof is, is pretty trivial.

230
00:17:56,776 --> 00:18:00,775
So if you don't care about the proof,
that's fine. You can skip it without loss.

231
00:18:00,775 --> 00:18:04,826
I'm including it just for completeness.
And I gotta say, I don't know of another

232
00:18:04,826 --> 00:18:08,620
mathematical statement which is
simultaneously so trivial to prove, and so

233
00:18:08,620 --> 00:18:12,465
unbelievably useful. There's really
something remarkable about linearity of

234
00:18:12,465 --> 00:18:16,413
expectation. So how does the proof go?
Well, honestly, we just write out the sum,

235
00:18:16,413 --> 00:18:20,873
the definition of an expectation. We
reverse the sums, and we're done. So, let

236
00:18:20,873 --> 00:18:25,604
me start with the right-hand side of the
equation. So that's, that was the sum of

237
00:18:25,604 --> 00:18:30,512
the expectations of the random variables.
So now, let's just apply the definition of

238
00:18:30,512 --> 00:18:34,947
expectation. So, it's just a weighted
average over the possible outcomes. And

239
00:18:34,947 --> 00:18:39,678
now, instead of summing first over the,
the reign of variable j and then over the

240
00:18:39,678 --> 00:18:44,172
realized outcome I, I am going to, to it
in a reversed order. I'm going to sum

241
00:18:44,172 --> 00:18:49,393
first over the outcome I and then over the
random variable j. Now the probability of

242
00:18:49,393 --> 00:18:55,358
outcome I is independent of j. So we can
yank the p of I outside of that inter-sum.

243
00:18:55,358 --> 00:19:00,901
But now what have we got? So inside the
parentheses, we simply have the value of

244
00:19:00,901 --> 00:19:05,883
the sum of the X-I's, X-X-J's on the
outcome I. And over here, we're just

245
00:19:05,883 --> 00:19:11,145
averaging the sum of the X-J's with
respect to the probabilities, the P-I's.

246
00:19:11,145 --> 00:19:16,548
So this is just the definition of the
expectation of the sum of the reign of

247
00:19:16,548 --> 00:19:23,020
variables. So that's it. So linearity of
expectation is really just a reversal of

248
00:19:23,020 --> 00:19:28,071
the double sums. Now for those of you who
are rusty on these kind of manipulations,

249
00:19:28,071 --> 00:19:32,367
I just want to point out, you know this
reversal of the double sum itself, is

250
00:19:32,367 --> 00:19:36,947
there's nothing complicated at all, about
what's going on. So if you want a really

251
00:19:36,947 --> 00:19:41,413
pedestrian way to think about what's
happening, just imagine that we take these

252
00:19:41,413 --> 00:19:46,894
sum adds, these [inaudible]. And we just
write them out in a grid, where one, or

253
00:19:46,894 --> 00:19:52,653
let's just say the columns are indexed by
the random variable J and the rows are

254
00:19:52,653 --> 00:19:58,482
indexed by the outcome I. And in a given
cell of this grid, we'd just write the sum

255
00:19:58,482 --> 00:20:03,436
in XJI times PI And the point is both of
the double sums, both the first one and

256
00:20:03,436 --> 00:20:07,895
the second one, are just the sum of all of
the entries in this grid. In the first sum

257
00:20:07,895 --> 00:20:12,141
we first group things according to the row
and then sum each of them up and the

258
00:20:12,141 --> 00:20:16,493
second sum, we group things according to
the column. We do column sums and then sum

259
00:20:16,493 --> 00:20:20,686
up the column sums. But either way it's
just some of the entries in the grid, so

260
00:20:20,686 --> 00:20:24,826
it doesn't matter which way you sum. Okay?
So reversal of double sums is just a

261
00:20:24,826 --> 00:20:29,033
trivial thing. So if you get lost in the
notation with these double sums, the point

262
00:20:29,033 --> 00:20:33,210
is, you can just interpret each of them in
terms of this grid. Both of these double

263
00:20:33,210 --> 00:20:37,387
sums are nothing more than the sum of the
values in all of the cells of this grid.

264
00:20:37,387 --> 00:20:41,666
One order of summation just says you group
first according to row sums, and then sum

265
00:20:41,666 --> 00:20:45,690
those up. That's the first summation. The
second summation, you first take column

266
00:20:45,690 --> 00:20:49,561
sums, and then sum those up. But of course
it doesn't matter. You just get the

267
00:20:49,561 --> 00:20:53,534
results of everything in the grid, okay?
So there's no, [inaudible], no tricks up

268
00:20:53,534 --> 00:20:57,660
my sleeve when I reverse these sums. It's
a totally elementary trivial thing, okay?

269
00:20:57,660 --> 00:21:02,612
So, again, linearity expectation. Trivial
to prove. Incredibly useful, don't forget

270
00:21:02,612 --> 00:21:07,100
it. So I want to conclude this video with
one final example in order to tie together

271
00:21:07,100 --> 00:21:11,601
all of the concepts that we just learned
or just reviewed. And, it's going to be an

272
00:21:11,601 --> 00:21:15,680
example of load balancing. A sign in
process used to servers. But this in fact

273
00:21:15,680 --> 00:21:20,022
is quite important for the analysis of
hashing that we're going to see toward the

274
00:21:20,022 --> 00:21:24,599
end of the course as well. But for now
let's just think about the, the following

275
00:21:24,599 --> 00:21:29,163
simple problem. For some integer n you
have n computer processes would have to be

276
00:21:29,163 --> 00:21:33,129
assigned to n servers in some way. Now
you're feeling really lazy. Okay. So

277
00:21:33,129 --> 00:21:37,313
you're just gonna take each of the
processes and you're just gonna assign it

278
00:21:37,313 --> 00:21:41,605
to a totally random server. Okay? With
each server equally likely to get a given

279
00:21:41,605 --> 00:21:46,249
process. And the question I want to study
is, does this laziness cost you, at least

280
00:21:46,249 --> 00:21:50,815
on average? So if you look at a server,
what's the expected load? So let's proceed

281
00:21:50,815 --> 00:21:54,693
to the solution, the answer to this
question. So, before you start talking

282
00:21:54,693 --> 00:21:58,947
about expectations, one has to be clear
about the sample space and what are the

283
00:21:58,947 --> 00:22:03,148
probabilities of the various outcomes? So
remember the sample space omega just

284
00:22:03,148 --> 00:22:07,187
denotes every possible thing that can
happen. So what are we doing? For each

285
00:22:07,187 --> 00:22:11,280
process, we're assigning it to a random
server, so all of the things that can

286
00:22:11,280 --> 00:22:15,050
happen are all of the different
assignments of these end processes to

287
00:22:15,050 --> 00:22:19,413
these end servers and if you think about
it, there are N raised to the N possible

288
00:22:19,413 --> 00:22:24,452
outcomes, because you have N choices for
each of the N processes. Moreover, because

289
00:22:24,452 --> 00:22:28,703
each process is assigned to one of the
servers uniformly at random, each of these

290
00:22:28,703 --> 00:22:33,059
[inaudible] assignments is equally likely.
Probability one over N to the N. Now that

291
00:22:33,059 --> 00:22:36,996
we have a sample space, we're in a
position to define a random variable. And

292
00:22:36,996 --> 00:22:41,300
we already know what random variable we
care about. We care about the average load

293
00:22:41,300 --> 00:22:45,551
of a server. Now, all of the servers are
exactly the same, so we just have to focus

294
00:22:45,551 --> 00:22:49,592
on one server, let's say the first server.
And look at the number of processes

295
00:22:49,592 --> 00:22:53,750
assigned to it. And if you go back to the
problem statement what we're asked is to

296
00:22:53,750 --> 00:22:57,725
compute the expected value of Y. The
expected number of processes assigned to a

297
00:22:57,725 --> 00:23:02,467
server. [sound] Now, of course, in
principle we could go to the definition of

298
00:23:02,467 --> 00:23:07,718
expectation and just compute by brute
force the sum over all possible outcomes

299
00:23:07,718 --> 00:23:12,455
of the value of y and take the average.
Unfortunately there are end of the end

300
00:23:12,455 --> 00:23:16,965
different outcomes. And that's a lot. So,
what can we do other than this brute force

301
00:23:16,965 --> 00:23:21,365
computation? Well, recall our example of
linear of expectation and the sum of two

302
00:23:21,365 --> 00:23:25,486
dice. We observed that instead of com.
Computing the sum by enumerating all

303
00:23:25,486 --> 00:23:29,771
thirty-six outcomes. It was much better to
focus on a single dye, compute its

304
00:23:29,771 --> 00:23:34,056
expectation and then conclude with
linearity of expectation. So we'll do the

305
00:23:34,056 --> 00:23:38,341
same thing here, instead of focusing on
the sum Y, we'll focus on constituent

306
00:23:38,341 --> 00:23:42,851
parts of Y. So whether or not a single
process gets assigned to the first server

307
00:23:42,851 --> 00:23:47,375
and then we'll get away with that with
linearity of expectation. So, more

308
00:23:47,375 --> 00:23:53,680
precisely, for a given process, J, let's
define XJ to be whether to be one if and

309
00:23:53,680 --> 00:23:59,830
only if the j process gets assigned to the
first server, zero otherwise. 01 random

310
00:23:59,830 --> 00:24:05,151
variables like XJ are often called
indicator random variables. That's because

311
00:24:05,151 --> 00:24:10,333
they, in effect, indicate whether or not a
certain event occurs. In this case,

312
00:24:10,333 --> 00:24:15,930
whether or not the [inaudible] process
gets assigned to the first server. Why did

313
00:24:15,930 --> 00:24:21,527
I make this definition? Well, observe that
the total number of processes that gets

314
00:24:21,527 --> 00:24:27,285
assigned to the first server is simply the
sum from J equal one to N of XJ. Xj Says

315
00:24:27,285 --> 00:24:32,468
whether or not a given process; the Jade
process is on the first server. The total

316
00:24:32,468 --> 00:24:37,322
number is the sum of these over all J. Now
the benefit from this maneuver is we only

317
00:24:37,322 --> 00:24:41,693
to compute the expectation of a extremely
simple indicator random variable XJ. This

318
00:24:41,693 --> 00:24:45,906
is like the win that we got when we're
summing up two dice by instead of having

319
00:24:45,906 --> 00:24:49,961
to compute the sum, the expected value of
the sum, we just had to focus on the

320
00:24:49,961 --> 00:24:53,543
expectation of a single die. That was
really easy. Similarly here the

321
00:24:53,543 --> 00:24:57,703
expectation of a single XJ is really easy.
Specifically, let's write it out just

322
00:24:57,703 --> 00:25:02,064
using the definition of the expectation.
So the expected value of an X J is. Well.

323
00:25:02,064 --> 00:25:06,631
Let's group together all the outcomes in
which it takes on the value zero. So, the

324
00:25:06,631 --> 00:25:10,915
contributions of the expectation is zero
for all of those outcomes. And then

325
00:25:10,915 --> 00:25:15,863
there's the rest of the outcome where X J
takes on the value one. And in those cases

326
00:25:15,863 --> 00:25:20,281
it contributes one to the expectation.
Now, obviously, we get some happy

327
00:25:20,281 --> 00:25:24,969
cancellation happening here, with the zero
part, and all we have to worry about is

328
00:25:24,969 --> 00:25:29,656
the probability that XJ takes on the value
one. Okay, what was XJ again? How did we

329
00:25:29,656 --> 00:25:34,575
define it? Remember it's the event that,
it's one exactly when the Jth process gets

330
00:25:34,575 --> 00:25:38,741
assigned to the first server. How are
processes assigned. Or, remember the

331
00:25:38,741 --> 00:25:42,792
proposed solution assigned to each
process, to each of the end servers,

332
00:25:42,792 --> 00:25:47,422
equally likely, with uniform probability.
So, the probability that the Jth process

333
00:25:47,422 --> 00:25:53,078
gets assigned to the first server, is one
over N. So this leaves us with just the

334
00:25:53,078 --> 00:25:58,704
sum, from j equal one to n, of one over n.
That is, we just sum up one over n with

335
00:25:58,704 --> 00:26:04,473
itself n times. This of course is equal to
one. So summarizing, the expected number

336
00:26:04,473 --> 00:26:10,526
of processes assigned to a given server is
exactly one. So really on average we don't

337
00:26:10,526 --> 00:26:15,654
lose much for our laziness of just
randomly spraying the process to the

338
00:26:15,654 --> 00:26:22,045
servers. On average for any given server
we do fine. And that's a good illustration

339
00:26:22,045 --> 00:26:26,668
of the role that randomness plays in a lot
of algorithm designed in computer science

340
00:26:26,668 --> 00:26:31,073
more generally. Often using random choices
you can do surprisingly well with very

341
00:26:31,073 --> 00:26:35,587
little work and that's indeed part of the
crux of quick short, if you choose pivots

342
00:26:35,587 --> 00:26:40,133
at random you get an average running time
of N log N just as good as. So at the end

343
00:26:40,133 --> 00:26:44,612
of the day what we find is that the
expected number of processes assigned to a

344
00:26:44,612 --> 00:26:48,655
given server; say the first server, is
just one. So, at least if we only care

345
00:26:48,655 --> 00:26:53,013
about averages, we lose very little from
this trivial process of randomly spraying

346
00:26:53,013 --> 00:26:57,269
the process through the server. On average
any given server has just one process on

347
00:26:57,269 --> 00:27:01,166
it. This is characteristic of the role
that randomization plays in algorithm

348
00:27:01,166 --> 00:27:04,755
design and computer science more
generally. Often we can get away with

349
00:27:04,755 --> 00:27:08,755
really simple heuristics just by making
random choices. Of course, quicksort is

350
00:27:08,755 --> 00:27:12,498
one example of that, where we get an
extremely prevalently used practical

351
00:27:12,498 --> 00:27:16,754
sorting algorithm just by making randomly
chosen pivots in every recursive call.