1
00:00:00,000 --> 00:00:04,320
Having completed our first analysis of an
algorithm, namely an upper bound on the

2
00:00:04,320 --> 00:00:08,746
running time of the Merge Short algorithm.
What I wanna do next is take a step back,

3
00:00:08,746 --> 00:00:13,173
and be explicit about three assumptions,
three biases that we made when we did this

4
00:00:13,173 --> 00:00:17,386
analysis of Merge Short, and interpreted
the results. These three assumptions we

5
00:00:17,386 --> 00:00:21,600
will adopt as guiding principles for how
to reason about algorithms, and how to

6
00:00:21,600 --> 00:00:25,973
define a so called fast algorithm for the
rest of the course. So, the first guiding

7
00:00:25,973 --> 00:00:30,240
principle is that we used what's often
called worst case analysis. By worst case.

8
00:00:30,240 --> 00:00:34,715
Analysis, I simply mean that our upper
bound of six N log N plus six N. Applies

9
00:00:34,715 --> 00:00:39,533
to the number of lines of executed for
every single input array of length end. We

10
00:00:39,533 --> 00:00:44,352
made absolutely no assumptions about the
input, where it comes from, what it looks

11
00:00:44,352 --> 00:00:48,932
like beyond what the input length N was.
Put differently, if, hypothetically, we

12
00:00:48,932 --> 00:00:53,869
had some adversary whose sole purpose in
life was to concoct some malevolent input

13
00:00:53,869 --> 00:00:58,627
designed to make our algorithm run as slow
as possible. The worst this adversary

14
00:00:58,627 --> 00:01:03,367
could do, is. Upper bounded by this same
number, 6N log N + 6N. Now, this, so,

15
00:01:03,367 --> 00:01:08,177
sort of worst case guarantee popped out so
naturally from our analysis of Merge

16
00:01:08,177 --> 00:01:12,987
Short, you might well be wondering, what
else could you do? Well, two other methods

17
00:01:12,987 --> 00:01:17,677
of analysis, which do have their place,
although we won't really dicuss them in

18
00:01:17,677 --> 00:01:22,366
this course, are quote unquote, average
case analysis. And also the use of a set

19
00:01:22,366 --> 00:01:26,637
of prespecified benchmarks. By average
case analysis, I mean, you analyze the

20
00:01:26,637 --> 00:01:30,938
average running time of an algorithm under
some assumption about the relative

21
00:01:30,938 --> 00:01:35,350
frequencies of different inputs. So, for
example, in the sorting problem, one thing

22
00:01:35,350 --> 00:01:39,983
you could do, although it's not what we do
here. You could assume that every possible

23
00:01:39,983 --> 00:01:44,395
input array is equally unlikely, and
then analyze the average running time of

24
00:01:44,395 --> 00:01:48,751
an algorithm. By benchmarks, I just mean
that one agrees up front about some set,

25
00:01:48,751 --> 00:01:53,150
say ten or twenty, Benchmark inputs, which
are thought to represent practical or

26
00:01:53,150 --> 00:01:57,347
typical inputs for the algorithm. Now,
both average-case analysis and benchmarks

27
00:01:57,347 --> 00:02:01,438
are useful in certain settings, but for
them to make sense, you really have to

28
00:02:01,438 --> 00:02:05,688
have domain knowledge about your problem.
You need to have some understanding of

29
00:02:05,688 --> 00:02:09,778
what inputs are more common than others,
what inputs better represent typical

30
00:02:09,778 --> 00:02:13,763
inputs than others. By contrast, in
worst-case analysis, by definition you're

31
00:02:13,763 --> 00:02:17,694
making absolutely no assumptions about
where the input comes from. So, as a

32
00:02:17,694 --> 00:02:20,828
result, worst-case analysis is
particularly appropriate for

33
00:02:20,828 --> 00:02:25,456
general-purpose sub-routines. Sub-routines
that you design. Find without having any

34
00:02:25,456 --> 00:02:30,192
knowledge of how they will be used or what
kind of inputs they will be used on. And

35
00:02:30,192 --> 00:02:34,556
happily, another bonus of doing worst case
analysis, as we will in this course, it's

36
00:02:34,556 --> 00:02:38,388
usually mathematically much more
attractable than trying to analyze the

37
00:02:38,388 --> 00:02:42,539
average performance of an algorithm under
some distribution over inputs. Or to

38
00:02:42,539 --> 00:02:46,903
understand the detailed behavior of an
algorithm on a particular set of benchmark

39
00:02:46,903 --> 00:02:51,161
inputs. This mathemetical tractabilty
was reflected in our Merge Sort analysis,

40
00:02:51,161 --> 00:02:55,258
where we had no a priori goal of
analyzing the worst case, per se. But it's

41
00:02:55,258 --> 00:02:59,630
naturally what popped out of our reasoning
about the algorithm's running time. The

42
00:02:59,630 --> 00:03:03,976
second and third guiding principles are
closely related. The second one is that,

43
00:03:03,976 --> 00:03:07,827
in this course, when we analyze
algorithms, we won't worry unduly about

44
00:03:07,827 --> 00:03:12,228
small constant factors or lower order
terms. We saw this philosophy at work very

45
00:03:12,228 --> 00:03:16,629
early on in our analysis of merge sort.
When we discussed the number of lines of

46
00:03:16,629 --> 00:03:21,015
code that the merge subroutine requires.
We first upper-bounded it by 4m plus two,

47
00:03:21,015 --> 00:03:25,120
for an array of length m, and then we
said, eh, let's just think about it as 6m

48
00:03:25,120 --> 00:03:29,118
instead. Let's have a simpler, sloppy
upper-bound and work with that. So, that

49
00:03:29,118 --> 00:03:33,116
was already an example of not worrying
about small changes in the constant

50
00:03:33,116 --> 00:03:37,327
factor. Now, the question you should be
wondering about is, why do we do this, and

51
00:03:37,327 --> 00:03:41,204
can we really get away with it? So let me
tell you about the justifications for this

52
00:03:41,204 --> 00:03:45,430
guiding principle. So the first motivation
is clear, and we used it already in our

53
00:03:45,430 --> 00:03:49,837
merge short analysis. Which is simply way
easier mathematically, if we don't have

54
00:03:49,837 --> 00:03:54,190
to, precisely pin down what the [inaudible] constant factors and lower-order terms are.

55
00:03:54,190 --> 00:03:58,479
The second justification is a little less
obvious, but is extremely important. So, I

56
00:03:58,479 --> 00:04:02,715
claim that, given the level at which we're
describing and analyzing algorithms in

57
00:04:02,715 --> 00:04:06,325
this course, it would be totally
inappropriate to obsess unduly about

58
00:04:06,325 --> 00:04:10,143
exactly what the constant factors are.
Recall our discussion of the merge

59
00:04:10,143 --> 00:04:13,961
subroutine. So, we wrote that subroutine
down in pseudocode, and we gave it

60
00:04:13,961 --> 00:04:18,250
analysis of 4m plus two on the number of
lines of code executed, given an input of

61
00:04:18,250 --> 00:04:22,278
length m. We also noted that, it was
somewhat ambiguous exactly how many lines

62
00:04:22,278 --> 00:04:26,677
of code we should count it as, depending
on how you count loop increments and so on. So

63
00:04:26,677 --> 00:04:30,811
even there it's small constant factors
could creep in given the under

64
00:04:30,811 --> 00:04:35,064
specification of the pseudo code.
Depending on how that pseudo code gets

65
00:04:35,064 --> 00:04:39,921
translated into an actual program language
like C or Java. You'll see the number of

66
00:04:39,921 --> 00:04:44,330
lines of code deviate even further, not
but a lot but again by small constant

67
00:04:44,330 --> 00:04:48,625
factors. When such a program is then
compiled down into machine code, you'll

68
00:04:48,625 --> 00:04:52,977
see even greater variance depending on the
exact processor, the compiler, the

69
00:04:52,977 --> 00:04:56,985
compiler optimizations, the programming
implementation, and so on. So to

70
00:04:56,985 --> 00:05:01,843
summarize, because we're going to describe
algorithms at a level. That transcends any

71
00:05:01,843 --> 00:05:06,438
particular programming language. It would
be inappropriate to specify precise

72
00:05:06,438 --> 00:05:11,092
constants. The precise constants were
ultimately determined by more machine

73
00:05:11,092 --> 00:05:15,567
dependent aspects like who the programmer
is, what the compiler is, what the

74
00:05:15,567 --> 00:05:20,289
processor is, and so on. And now the third
justification is frankly, we're just going

75
00:05:20,289 --> 00:05:25,415
to be able to get away with it. [sound]
That is, one might be concerned that

76
00:05:25,415 --> 00:05:29,994
ignoring things like small constant factors
leads us astray. That we wind up deriving

77
00:05:29,994 --> 00:05:34,683
results which suggest that an algorithm is
fast when it's really slow in practice, or

78
00:05:34,683 --> 00:05:38,985
vice versa. And for the problems we
discuss in this course we'll get extremely

79
00:05:38,985 --> 00:05:43,212
accurate predictive power. Even though we
won't be keeping track of lower terms and

80
00:05:43,212 --> 00:05:47,180
constant factors. When the mathematical
analysis we do suggests that an algorithm

81
00:05:47,180 --> 00:05:50,806
is fast, indeed it will be. When it
suggests that it's not fast, indeed that

82
00:05:50,806 --> 00:05:54,627
will be the case. So we lose a little bit
of granularity of information. But we

83
00:05:54,627 --> 00:05:58,448
don't lose in what we really care about,
which is accurate guidance about what

84
00:05:58,448 --> 00:06:02,221
algorithms are gonna be faster than
others. So the first two justifications, I

85
00:06:02,221 --> 00:06:06,287
think, are pretty self evident. This third
justification is more of an assertion, but

86
00:06:06,287 --> 00:06:10,304
it's one we'll be baking up over and over
again as we proceed through this course.

87
00:06:10,304 --> 00:06:13,979
Now, don't get me wrong. I'm not saying
constant factors aren't important in

88
00:06:13,979 --> 00:06:18,330
practice. Obviously, for crucial programs
the constant factors are hugely important.

89
00:06:18,330 --> 00:06:22,776
If you're running the sorta crucial loop,
you know, your start up's survival depends

90
00:06:22,776 --> 00:06:26,633
on, by all means optimize the constant
like crazy. The point is just that

91
00:06:26,633 --> 00:06:30,973
understanding tiny constant factors in the
analysis is an inappropriate level of

92
00:06:30,973 --> 00:06:35,205
granularity for the kind of algorithm
analysis we're going to be doing in this

93
00:06:35,205 --> 00:06:39,696
course. Okay, lets move on the, the third
and final guiding principle. So the third

94
00:06:39,696 --> 00:06:44,061
principle is that we're going to use
what's called asymptotic analysis, by

95
00:06:44,061 --> 00:06:49,016
which I mean we will focus on the case of
a large input sizes. The performance of an

96
00:06:49,016 --> 00:06:53,671
algorithm as the size N of the input grows
large, that is, tends to infinity. Now

97
00:06:53,671 --> 00:06:58,299
this focus on large input size is it was
already evident when we interpreted our

98
00:06:58,299 --> 00:07:02,523
bound on Merge Sort. So, how did we
describe the bound on Merge Sort? We said,

99
00:07:02,523 --> 00:07:07,209
oh, well, it needs a number of operations
proportional, a constant fact or times in

100
00:07:07,209 --> 00:07:11,721
login. And we very cavalierly declared
that this was better than any algorithm

101
00:07:11,721 --> 00:07:16,350
which has quadratic dependence of it's
running time on the number of operations.

102
00:07:16,350 --> 00:07:21,017
So for example, we argued that merge sort
is a better, faster algorithm than

103
00:07:21,017 --> 00:07:25,934
something like insertion sort, without
actually discussing the constant factors

104
00:07:25,934 --> 00:07:31,185
at all. So mathematically. We were saying
the running time of merge short, which we

105
00:07:31,185 --> 00:07:36,676
know, which we can represent as the
function. Six N log base two of N + 6N

106
00:07:36,676 --> 00:07:41,315
is better than any function which has a
quadratic dependence on n.  Even one with a

107
00:07:41,315 --> 00:07:45,955
small constant like lets say 1/2 N squared which might roughly be the running

108
00:07:45,955 --> 00:07:50,640
time of insertion sort. And this is a
mathematical statement that is true if and

109
00:07:50,640 --> 00:07:55,230
only if N is sufficiently large once N
grows large it is certainly true that

110
00:07:55,230 --> 00:08:00,050
the expression on the left is smaller than
the expression on the right but for small

111
00:08:00,050 --> 00:08:04,926
N the expression on the right is actually
going to be smaller because of the smaller

112
00:08:04,926 --> 00:08:09,689
leading term so in saying that merge sort
is superior to insertion sort the bias

113
00:08:09,689 --> 00:08:14,566
is that we're focusing on problems with a
large N so the question you should have is

114
00:08:14,566 --> 00:08:18,983
that reasonable is that a justified
assumption to focus on large input sizes

115
00:08:18,983 --> 00:08:23,320
and the answer is certainly yes. So the
reason we focus on large input sizes is

116
00:08:23,320 --> 00:08:27,364
because, frankly, those are the only
problems which are even, which are at all

117
00:08:27,364 --> 00:08:31,780
interesting. If all you need to do is sort
100 numbers, use whatever method you want,

118
00:08:31,780 --> 00:08:36,089
and it's gonna happen instantaneously on
modern computers. You don't need to know

119
00:08:36,089 --> 00:08:40,392
say, the divide and conquer paradigm, if
all you need to do is sort 100 numbers. So

120
00:08:40,392 --> 00:08:44,563
one thing you might be wondering is if,
with computers getting faster all the time

121
00:08:44,563 --> 00:08:48,276
according to Moore's Law, if really, it
doesn't even matter to think about

122
00:08:48,276 --> 00:08:52,142
algorithmic analysis, if eventually all
problem sizes will just be trivially

123
00:08:52,142 --> 00:08:56,129
solvable on super fast computers. But, in
fact, the opposite is true. Moore's Law,

124
00:08:56,129 --> 00:09:00,432
with computers getting faster, actually
says that our computational ambitions will

125
00:09:00,432 --> 00:09:04,841
naturally grow. We naturally focus on ever
larger problem sizes. And the gulf between

126
00:09:04,841 --> 00:09:08,672
an N squared algorithm and an m log n
algorithm will become ever wider. A

127
00:09:08,672 --> 00:09:12,765
different way to think about it is in
terms of how much bigger a problem size

128
00:09:12,765 --> 00:09:17,047
you can solve. As computers get faster. If
you are using an algorithm with a running

129
00:09:17,047 --> 00:09:21,160
time which is proportional to the input
size then the computers get faster by a

130
00:09:21,160 --> 00:09:25,531
factor of four then you can solve problems
that are factor of four or larger. Whereas

131
00:09:25,531 --> 00:09:29,592
if you are using an algorithm whose
running time is proportional to the square

132
00:09:29,592 --> 00:09:33,654
of the input size then a computer gets
faster by a factor of four, you can only

133
00:09:33,654 --> 00:09:37,716
solve double the problem size and we'll
see even starker examples of this gulf

134
00:09:37,716 --> 00:09:42,177
between different algorithm approaches as
the time goes on. So to drive this point

135
00:09:42,177 --> 00:09:47,080
home. Let me show you a couple of graphs.
So what we're looking at here, is we're

136
00:09:47,080 --> 00:09:53,036
looking at a graph, of two functions. So
the solid function. Is the upper bound

137
00:09:53,036 --> 00:10:01,841
that we proved on merge sort. So this is
gonna be 6nlog(base2)n plus 6n. And the

138
00:10:01,841 --> 00:10:08,001
dotted line is an estimate. A rather
generous estimate about the running time

139
00:10:08,001 --> 00:10:12,218
of, [inaudible] sort. Namely one-half
times N. Squared. And we see here in the

140
00:10:12,218 --> 00:10:16,547
graph exactly the behavior that we
discussed earlier, which is that the small

141
00:10:16,547 --> 00:10:23,124
N. Down here. In fact because one-half N.
Squared has a smaller leading constant

142
00:10:23,124 --> 00:10:29,459
it's actually a smaller function. And this
is true up to this crossing point of maybe

143
00:10:29,459 --> 00:10:36,000
90 or so. Again, beyond n=90. The
quadratic growth in the N squared term.

144
00:10:36,000 --> 00:10:40,314
Overwhelms the fact that it had a smaller
constant and it starts being bigger than

145
00:10:40,314 --> 00:10:44,628
this other function six of N + six N so in
the regime below 90 it's predicting that

146
00:10:44,628 --> 00:10:48,942
the insertion store will be better and in
the regime above 90 it's predicting that

147
00:10:48,942 --> 00:10:53,048
merge sort will be faster. Now here's
what's interesting let's scale the X axis

148
00:10:53,048 --> 00:10:57,414
let's look well beyond this crossing point
of 90 let's just increase it in order of

149
00:10:57,414 --> 00:11:01,676
magnitude up to a raise in size 1500. And
I want to emphasize these are still very

150
00:11:01,676 --> 00:11:05,574
small problem sizes. If all you need to do is
sort arrays of size 1500 you really don't

151
00:11:05,574 --> 00:11:09,053
need to know Divide-and-conquer
or anything else we'll talk about -- that's a

152
00:11:09,053 --> 00:11:12,901
pretty trivial problem on modern
computers. [sound]. So what we're seeing

153
00:11:12,901 --> 00:11:16,904
is, that even for very modest problem
sizes here, array of, of, size, say 1500.

154
00:11:17,060 --> 00:11:21,271
The quadratic dependence in the insertion
sort bound is more than dwarfing the

155
00:11:21,271 --> 00:11:25,690
fact, that it had a lower constant factor.
So in this large regime, the gulf between

156
00:11:25,690 --> 00:11:29,693
the two algorithms is growing. And of
course, if I increased it another 10X or

157
00:11:29,693 --> 00:11:33,644
100x or 1000x to get to genuinely
interesting problem sizes, the gap between

158
00:11:33,644 --> 00:11:37,647
these two algorithms would be even bigger,
it would be huge. That said, I'm not

159
00:11:37,647 --> 00:11:41,806
saying you should be completely ignorant
of constant factors when you implement

160
00:11:41,806 --> 00:11:45,810
algorithms. It's still good to have a
general sense of what these constant

161
00:11:45,810 --> 00:11:50,060
factors are so for example in highly tuned
versions of Merge Sort which you'll find

162
00:11:50,060 --> 00:11:53,802
in mny programming libraries. In
fact, because of the difference in

163
00:11:53,802 --> 00:11:57,967
constant factors, the algorithm will
actually switch from Merge Sort over to

164
00:11:57,967 --> 00:12:02,240
insertion sort, once the problem size
drops below some particular threshold, say

165
00:12:02,240 --> 00:12:06,405
seven elements, or something like that. So
for small problem sizes, you use the

166
00:12:06,405 --> 00:12:10,841
algorithm with smaller constant factors,
and the insertion sort for larger problem

167
00:12:10,841 --> 00:12:15,095
sizes, you use the algorithm and better
rate of growth, mainly merge short. So, to

168
00:12:15,095 --> 00:12:19,418
review our first guiding principal is that
we're going to pursue worse case analysis.

169
00:12:19,418 --> 00:12:23,232
We're going to look to bounds on the
performance, on the running time of an

170
00:12:23,232 --> 00:12:26,742
algorithm which make no domain
assumptions, which make no assumptions

171
00:12:26,742 --> 00:12:30,861
about which input of a given length the
algorithm is provided. The second guiding

172
00:12:30,861 --> 00:12:34,930
principal is we're not going to focus on
constant factors or lower returns, that

173
00:12:34,930 --> 00:12:38,999
would be inappropriate, given the level of
granularity at which we're describing

174
00:12:38,999 --> 00:12:43,119
algorithms and third is where going to
focus on the rate of growth of algorithms

175
00:12:43,119 --> 00:12:47,262
for large problem sizes. Putting these
three principles together, we get a

176
00:12:47,262 --> 00:12:51,437
mathematical definition of a fast
algorithm. Namely, we're gonna pursue

177
00:12:51,437 --> 00:12:56,267
algorithms whose worst case running time
grows slowly as a function of the input

178
00:12:56,267 --> 00:13:01,039
size. So let me tell you how you should
interpret what I just wrote down in this

179
00:13:01,039 --> 00:13:05,452
box. So on the left hand side is clearly
what we want. Okay, we want algorithms

180
00:13:05,452 --> 00:13:09,626
which run quickly if we implement them.
And on the right hand side is a proposed

181
00:13:09,626 --> 00:13:13,331
mathematical surrogate of a fast
algorithm. Right, the left hand side is

182
00:13:13,331 --> 00:13:17,557
not a mathematical definition. The right
hand side is, as well become clear in the

183
00:13:17,557 --> 00:13:21,731
next set of lectures. So we're identifying
fast algorithms, which, those that have

184
00:13:21,731 --> 00:13:25,801
good asymptotic running time which grows
slowly with the input size. Now, what

185
00:13:25,801 --> 00:13:29,750
would we want from a mathematical
definition? We'd want a sweet spot. On one

186
00:13:29,750 --> 00:13:34,327
hand we want something we can actually
reason about. This is why we zoom out and

187
00:13:34,327 --> 00:13:39,075
squint and ignore things like constant
factors and lower terms. We can't keep

188
00:13:39,075 --> 00:13:43,514
track of everything. Otherwise we'd never
be able to analyze stuff. On the other hand

189
00:13:43,514 --> 00:13:47,323
we don't want to throw out the baby with
the bath water, we want to retain

190
00:13:47,323 --> 00:13:51,441
predictive power and this turns out this
definition turns out for the problems

191
00:13:51,441 --> 00:13:55,662
we're going to talk about in this course,
to be the sweet spot for reasoning about

192
00:13:55,662 --> 00:13:59,780
algorithms okay worst case analysis using
the asymptotic running time. We'll be able to

193
00:13:59,780 --> 00:14:03,796
prove lots of theorems. We'll be able to
establish a lot of performance guarantees

194
00:14:03,796 --> 00:14:08,223
for fundamental algorithms but at the same
time we'll have good predictive power what

195
00:14:08,223 --> 00:14:12,186
the theory advocates will in fact be
algorithms that are known to be best in

196
00:14:12,186 --> 00:14:16,080
practice. So, the final explanation I owe
you, is, what do I mean by, the running

197
00:14:16,080 --> 00:14:19,863
time grows slowly, with respect to the
input size? Well, the answer depends a

198
00:14:19,863 --> 00:14:23,697
little bit on the context, but, for almost
all of the problems we're going to

199
00:14:23,697 --> 00:14:27,733
discuss, the holy grail will be to have
what's called a linear time algorithm, an

200
00:14:27,733 --> 00:14:31,718
algorithm whose number of instructions
grows proportional to the input size. So,

201
00:14:31,718 --> 00:14:35,804
we won't always be able to achieve linear
time, but that's, in some sense, the best

202
00:14:35,804 --> 00:14:39,789
case scenario. Notice, linear time is even
better than what we achieved with our

203
00:14:39,789 --> 00:14:43,875
merge short algorithm for sorting. Merge
short runs a little bit superlinear, it's

204
00:14:43,875 --> 00:14:47,729
n times log n, running as the input size.
If possible, we. To be linear time. It's

205
00:14:47,729 --> 00:14:51,588
not always gonna be possible, but that is
what we will aspire toward. For most of

206
00:14:51,588 --> 00:14:56,036
the problems we'll discuss in this course.
Looking ahead, the next series of videos

207
00:14:56,036 --> 00:14:59,986
is going to have two goals. First of all,
on the analysis side, I'll describe

208
00:14:59,986 --> 00:15:03,989
formally what I mean by asymptotic
running time. I'll introduce "Big Oh"

209
00:15:03,989 --> 00:15:07,992
notation and its variants, explain its
mathematical definitions, and give a

210
00:15:07,992 --> 00:15:12,047
number of examples. On the design side,
we'll get more experience applying the

211
00:15:12,047 --> 00:15:14,102
divide and conquer paradigm to further problems.  See you then.