1
00:00:00,012 --> 00:00:05,977
This sequence of videos is about the 
design and analysis of heuristics. 

2
00:00:05,977 --> 00:00:12,813
Algorithms which are generally guaranteed 
to be fast but not guaranteed to be 100% 

3
00:00:12,813 --> 00:00:16,848
correct. 
As a case study, we're going to revisit 

4
00:00:16,848 --> 00:00:22,543
the knapsack problem that we thought 
about earlier in the That they name a 

5
00:00:22,543 --> 00:00:25,757
programming section. 
Let briefly review the 3 strategies for 

6
00:00:25,757 --> 00:00:29,730
dealing with NP complete problems all of 
these are relevant for the knapsack 

7
00:00:29,730 --> 00:00:32,039
problems. 
The first strategy is to identify 

8
00:00:32,039 --> 00:00:36,190
computationally tractable special cases 
of the NP complete problem that you're 

9
00:00:36,190 --> 00:00:39,126
grappling with. 
If you're lucky, the instance is relevant 

10
00:00:39,126 --> 00:00:42,992
for your application will fall squarely 
into one of these computationally 

11
00:00:42,992 --> 00:00:46,990
tractable special cases. Even otherwise, 
it is useful have these kinds of sub 

12
00:00:46,990 --> 00:00:50,842
routines lying around. 
Perhaps with a little bit of extra work 

13
00:00:50,842 --> 00:00:55,432
you can reduce your problem instances to 
one of these computational tractable 

14
00:00:55,432 --> 00:00:58,582
special cases. 
A dynamic programming solution to the 

15
00:00:58,582 --> 00:01:01,642
knapsack problem identifies one such 
special case. 

16
00:01:01,642 --> 00:01:05,612
Namely, knapsack instances where the 
knapsack capacity is small. 

17
00:01:05,612 --> 00:01:08,760
For example, if it's comparable with the 
number of items. 

18
00:01:08,760 --> 00:01:12,223
The second approach is the design and 
analysis of heuristics. 

19
00:01:12,223 --> 00:01:15,434
Algorithms which are not guaranteed to be 
100% optimal. 

20
00:01:15,434 --> 00:01:18,015
Although more to say about those in a 
second. 

21
00:01:18,015 --> 00:01:21,956
The third strategy is to insist on a 
correct algorithm and being an NP 

22
00:01:21,956 --> 00:01:26,414
complete problem, you're then expecting 
to see exponential running time but still 

23
00:01:26,414 --> 00:01:30,733
they have running time quality to be 
better than what you have achieved with 

24
00:01:30,733 --> 00:01:34,420
naive brute force search. 
A dynamic programming solution to an 

25
00:01:34,420 --> 00:01:38,026
abstract problem can be viewed as a 
contribution in this direction. 

26
00:01:38,026 --> 00:01:41,897
Naive brute force search enumerating 
overall possible subsets of items. 

27
00:01:41,897 --> 00:01:46,084
It's going to take time proportional to 2 
to the n, where n is the number of items. 

28
00:01:46,084 --> 00:01:50,154
The dynamic programming algorithm takes 
time proportional to n the number of 

29
00:01:50,154 --> 00:01:54,133
items times W for knapsack capacity. 
For most problems of interest there's 

30
00:01:54,133 --> 00:01:58,203
then a programming solution or be at 
exponential for reasons we've discussed 

31
00:01:58,203 --> 00:02:02,515
in the past will be a Significant 
improvement over naive brute-force 

32
00:02:02,515 --> 00:02:05,300
search. 
[SOUND] Let's now return to the subject 

33
00:02:05,300 --> 00:02:09,835
of heuristics, the topic for today. 
Happily, despite being NP-complete, the 

34
00:02:09,835 --> 00:02:13,043
knapsack problem admits some quite useful 
heuristics. 

35
00:02:13,043 --> 00:02:17,105
We're going to study two of them. 
First we'll use the greedy algorithm 

36
00:02:17,105 --> 00:02:20,552
design paragon to come up with a pretty 
good heuristic. 

37
00:02:20,552 --> 00:02:24,932
Pleasingly, this pretty good Greedy 
heruistic is also blazingly fast. 

38
00:02:24,932 --> 00:02:29,838
We'll then pull out a different tool, 
mainly dynamic programming to develop yet 

39
00:02:29,838 --> 00:02:33,377
another heuristic. 
It's going to be polynomial time, not as 

40
00:02:33,377 --> 00:02:37,306
fast as a Greedy solution, but it's 
accuracy will be excellent. 

41
00:02:37,306 --> 00:02:42,305
It will get within a 1-epsilon factor of 
optimal where epsilon is a paramater that 

42
00:02:42,305 --> 00:02:45,759
we can tune as we wish. 
Now, there's zillions and zillions of 

43
00:02:45,759 --> 00:02:48,812
algorithms in the world, and anybody can 
write one of them down. 

44
00:02:48,812 --> 00:02:52,577
So, if someone proposes an algorithm to 
you, the owness is on them, to convince 

45
00:02:52,577 --> 00:02:55,807
you that it's a good idea. 
To convince you that it's, in some sense, 

46
00:02:55,807 --> 00:02:58,708
a good algorithm. 
Periodically, in these courses, I've been 

47
00:02:58,708 --> 00:03:01,862
making that argument in 2 ways. 
First of all, I provide you with an 

48
00:03:01,862 --> 00:03:04,661
argument, that the algorithm that I 
proposed, is correct. 

49
00:03:04,661 --> 00:03:08,322
It always is guaranteed to solve the 
problem that we're interested in. 

50
00:03:08,322 --> 00:03:12,149
Secondly, I argue that it solves the 
problem using a reasonable amount of 

51
00:03:12,149 --> 00:03:14,451
resources. 
So 1 great exhibit A would be say 

52
00:03:14,451 --> 00:03:17,494
Dijkstra's algorithm. 
We wanted to solve the shortest path 

53
00:03:17,494 --> 00:03:20,455
problem with a single source non-negative 
edge lengths. 

54
00:03:20,455 --> 00:03:24,339
I gave you a proof that Dijkstra's 
algorithm always achieves that goal, and 

55
00:03:24,339 --> 00:03:28,362
I showed you an implementation so that 
the running time was not much more than 

56
00:03:28,362 --> 00:03:33,218
what it takes to just read the input. 
Now, of course, this best case scenario 

57
00:03:33,218 --> 00:03:37,049
is not going to be realized for 
NP-Complete problems, and in particular 

58
00:03:37,049 --> 00:03:40,315
for the nap sack problem. 
So you have to give up either on the 

59
00:03:40,315 --> 00:03:43,910
correctness or on the running time. 
You have to make a compromise. 

60
00:03:43,910 --> 00:03:48,061
A tricky question then, is how do you 
argue that the compromise you're making 

61
00:03:48,061 --> 00:03:52,292
is not overly severe how do you argue 
that your approach to an NP-complete 

62
00:03:52,292 --> 00:03:56,274
Problem is a good one. 
Heuristics relax correctness, and so one 

63
00:03:56,274 --> 00:04:00,858
would like an argument that says, 
correctness is not relaxed by too much. 

64
00:04:00,858 --> 00:04:05,605
In the best case scenario, when you 
propose a heuristic, you should supply a 

65
00:04:05,605 --> 00:04:10,760
performance guarantee which says that the 
solution is almost correct in some sense 

66
00:04:10,760 --> 00:04:14,466
on every single instance. 
Or failing that, at least on many 

67
00:04:14,466 --> 00:04:18,166
instances of interest. 
Now in practice you do bump into problems 

68
00:04:18,166 --> 00:04:21,703
that are so hard you wind up having to 
resort to heuristics where you really 

69
00:04:21,703 --> 00:04:25,418
don't have a good understanding of how or 
when they work, where you really just 

70
00:04:25,418 --> 00:04:28,293
cross your fingers, run the heuristic and 
hope for the best. 

71
00:04:28,293 --> 00:04:31,831
Local search is probably the preeminent 
example of this type of heuristic, 

72
00:04:31,831 --> 00:04:35,313
something like the of the K means. 
An algorithm for clustering. 

73
00:04:35,313 --> 00:04:39,051
But, for the problem we're going to talk 
about today, we're going to seek out 

74
00:04:39,051 --> 00:04:42,044
heuristics that have worse case 
performance guarantees. 

75
00:04:42,044 --> 00:04:45,978
And of course, a broader theme of this 
discussion, and the reason I'm using 

76
00:04:45,978 --> 00:04:49,748
knapsack as an example is this will show 
us how our programmer toolbox or 

77
00:04:49,748 --> 00:04:53,860
algorithm toolbox is now rich enough to 
design not just lots of different exact 

78
00:04:53,860 --> 00:04:58,164
algorithms, but also lots of useful 
heuristics for complete problems. 

79
00:04:58,164 --> 00:05:03,505
So let me briefly remind you what the 
knapsack problem is: As input, we are 

80
00:05:03,505 --> 00:05:06,241
given 2 n+1 numbers. 
There's n items. 

81
00:05:06,241 --> 00:05:09,999
Each one has a postiive value and a 
positive weight. 

82
00:05:09,999 --> 00:05:13,656
The final number is a knapsack capacity 
Capital W. 

83
00:05:13,656 --> 00:05:18,007
The goal is to pack the knapsack with the 
most valuable set of items. 

84
00:05:18,007 --> 00:05:22,976
That is, you're seeking out a subset S of 
items and it should have the property 

85
00:05:22,976 --> 00:05:27,615
that the sum of the sizes of the items in 
your set S is bounded above by the 

86
00:05:27,615 --> 00:05:32,413
knapsack capital W subject to that 
constraint, your subset of chosen items 

87
00:05:32,413 --> 00:05:35,193
should maximize the sum. 
More of the values. 

88
00:05:35,193 --> 00:05:39,375
This is a fundamental problem that comes 
up all the time, in particular as a 

89
00:05:39,375 --> 00:05:41,979
subroutine. 
Really, when you have a constrain 

90
00:05:41,979 --> 00:05:46,136
resource that you want to use in the best 
way possible, that is basically the 

91
00:05:46,136 --> 00:05:49,294
knapsack problem. 
So let's turn to heuristics for the 

92
00:05:49,294 --> 00:05:52,404
knapsack problem. 
So now that I've told you that we're 

93
00:05:52,404 --> 00:05:57,000
happy with heuristics, we're willing to 
relax correctness, this all of a sudden 

94
00:05:57,000 --> 00:06:01,584
resuscitates the greedy algorithm design 
paradigm as a feasible approach to the 

95
00:06:01,584 --> 00:06:04,843
knapsack Knapsack problem. 
Because the knapsack problem is empty 

96
00:06:04,843 --> 00:06:08,392
complete, we certainly are not expecting 
to find a exactly correct greedy 

97
00:06:08,392 --> 00:06:12,251
algorithm, but maybe there's a greedy 
algorithm which is pretty good, and we're 

98
00:06:12,251 --> 00:06:16,243
expecting most greedy algorithms are 
going to run extremely quickly. 

99
00:06:16,243 --> 00:06:19,970
So let's talk through a potential greedy 
approach to the knapsack problem. 

100
00:06:19,970 --> 00:06:23,934
Probably the first idea you'd have would 
be to consider the items in some order. 

101
00:06:23,934 --> 00:06:27,736
and when you consider an item, you make 
an irrevocable decision at that point, 

102
00:06:27,736 --> 00:06:30,321
whether to include it in your solution. 
Or not. 

103
00:06:30,321 --> 00:06:34,611
So the question then is, in what order 
should you look at the items? Well, 

104
00:06:34,611 --> 00:06:38,826
what's our objective? Our objective is to 
maximize the value of our set. 

105
00:06:38,826 --> 00:06:43,233
So obviously, high value items are 
important, so maybe the first idea would 

106
00:06:43,233 --> 00:06:46,862
be to sort the items, from highest value, 
to lowest value. 

107
00:06:46,862 --> 00:06:50,917
But if you thik about that proposal for 
say, 30 sconds, you quickly realize that 

108
00:06:50,917 --> 00:06:53,552
this is a little naive. 
This is not the whole story. 

109
00:06:53,552 --> 00:06:57,552
If you had a high value item that fills 
up the whole knapsack, seems like that's 

110
00:06:57,552 --> 00:07:01,472
not quite as useful if you had an almost 
as higih value item that basically had 

111
00:07:01,472 --> 00:07:04,517
size close to zero. 
That didn't use up any of the knapsack at 

112
00:07:04,517 --> 00:07:07,447
all. 
Remember that each item has 2 properties, 

113
00:07:07,447 --> 00:07:11,501
not just its value but also its size, and 
both of these are important. 

114
00:07:11,501 --> 00:07:15,983
We want to prefer items that have a lot 
of value but we also want to prefer items 

115
00:07:15,983 --> 00:07:19,042
that are small. 
So I hope many of you are now sensing a 

116
00:07:19,042 --> 00:07:21,863
strong parallel between the present 
discussion. 

117
00:07:21,863 --> 00:07:25,672
and the discussion we had with our first 
serious study of agreeom. 

118
00:07:25,672 --> 00:07:29,328
That was scheduling jobs on a single 
machine to minimize the sum of the 

119
00:07:29,328 --> 00:07:32,991
weighted completion times. 
In that old scheduling problem, just like 

120
00:07:32,991 --> 00:07:36,722
in this one, each object had 2 
parameters, back then it was a job length 

121
00:07:36,722 --> 00:07:40,763
and a job weight or a job priority, and 
you had a preference for jobs that were 

122
00:07:40,763 --> 00:07:43,909
higher weight, you wanted. 
Them to be first, and you had a 

123
00:07:43,909 --> 00:07:46,993
preference for shortage ops, you wanted 
them to be first. 

124
00:07:46,993 --> 00:07:51,213
Here again we have items, two parameters, 
the value and the size, high value is 

125
00:07:51,213 --> 00:07:54,604
good, low size is good. 
Back in the scheduling problem, the key 

126
00:07:54,604 --> 00:07:57,713
idea was to look at jobs ratios and 
schedule in that order. 

127
00:07:57,713 --> 00:08:01,908
By analogy, here in the knapsack problem, 
if you want to take the 2 parameters of 

128
00:08:01,908 --> 00:08:05,813
each item, form a single parameter by 
which we can then sort the jobs, the 

129
00:08:05,813 --> 00:08:10,585
natural first cut to look at That is a 
ratio.Since we prefer high values, we 

130
00:08:10,585 --> 00:08:16,246
prefer low weights, the sensible think we 
look at is the ratio of the valuable item 

131
00:08:16,246 --> 00:08:21,626
divided by its size and we then consider 
items from the highest values of these 

132
00:08:21,626 --> 00:08:24,782
ratios to the lowest values of these 
ratios. 

133
00:08:24,782 --> 00:08:28,433
One way to think of this is that we 
consider the items in order of 

134
00:08:28,433 --> 00:08:32,752
non-decreasing bang per buck, we're 
always trying to get the most value for 

135
00:08:32,752 --> 00:08:36,794
our money where money is the amount of 
the knapsack capacity that we're 

136
00:08:36,794 --> 00:08:39,715
spending. 
So now that we have our greed ordering we 

137
00:08:39,715 --> 00:08:42,602
just proceed through the items one at a 
time and. 

138
00:08:42,602 --> 00:08:45,302
And we pack the items into this knapsack 
in this ordering. 

139
00:08:45,302 --> 00:08:49,052
Now what happens here and didn't actually 
trouble us in the scheduling problem is 

140
00:08:49,052 --> 00:08:52,402
at some point we might no longer be able 
to pack items into the knapsack. 

141
00:08:52,402 --> 00:08:55,377
The thing might get full. 
So once that happens, once we reach an 

142
00:08:55,377 --> 00:08:59,177
item which doesn't fit into the knapsack 
given the items that we've already packed 

143
00:08:59,177 --> 00:09:01,614
into it, we just stop the A greedy 
algorithm. 

144
00:09:01,614 --> 00:09:05,542
If you think about it, the way I've 
written this algorithm, aborting as soon 

145
00:09:05,542 --> 00:09:09,103
as you fail to pack 1 item is not really 
what you would do in practice. 

146
00:09:09,103 --> 00:09:11,400
You would do something a little bit 
smarter. 

147
00:09:11,400 --> 00:09:15,478
If you fail to pack some given item, say 
item k +1, well maybe item k+2, it has a 

148
00:09:15,478 --> 00:09:20,040
worse ratio but maybe it's much smaller. 
Maybe item k+2 does fit in the knapsack. 

149
00:09:20,040 --> 00:09:22,425
Then there's no reason not to just take 
it. 

150
00:09:22,425 --> 00:09:26,607
So in practice you'd go through the 
entire list of items, and you would just 

151
00:09:26,607 --> 00:09:29,491
pack whatever fits; skipping whatever 
doesn't fit. 

152
00:09:29,491 --> 00:09:34,136
For simplicity, I'm going to analyze this 
slightly worse heurisitc in the slides to 

153
00:09:34,136 --> 00:09:36,252
follow. 
So just a real quick example. 

154
00:09:36,252 --> 00:09:39,112
Let's consider the following 3 item 
instance. 

155
00:09:39,112 --> 00:09:43,667
So, I've taken the liberty of sorting the 
items by ratio. 

156
00:09:43,667 --> 00:09:47,232
The 1st item has value 2 in size 1 for 
ratio 2. 

157
00:09:47,232 --> 00:09:51,157
The 2nd item has a ratio of 4/3rd's value 
4 size 3. 

158
00:09:51,157 --> 00:09:55,902
And the 3rd item has a ratio of 1, value 
in weight equal to 3. 

159
00:09:55,902 --> 00:10:01,272
Let's say the nap-sack capasity is 5. 
So what does the greedy solution do. 

160
00:10:01,272 --> 00:10:05,871
It first packs in item number 1, then 
there's a residual knapsack capacity of 

161
00:10:05,871 --> 00:10:08,167
4. 
So there's room for item number 2 it 

162
00:10:08,167 --> 00:10:11,362
packs that. 
Now there's a residual knapsack capacity 

163
00:10:11,362 --> 00:10:15,062
of only 1 so there isn't any room for 
item number 3 so it halts. 

164
00:10:15,062 --> 00:10:19,062
So the greedy algorithm is just going to 
output the set 1, 2. 

165
00:10:19,062 --> 00:10:24,077
In this particular example, you will 
notice that this, greedy solution of 

166
00:10:24,077 --> 00:10:28,073
value 6, is in fact the optimal solution. 
You can't do better. 

167
00:10:28,073 --> 00:10:31,547
But, lets look at a second example with 
just 2 items. 

168
00:10:31,547 --> 00:10:36,516
So in this particular instance, what is 
the value of the output of the greedy 

169
00:10:36,516 --> 00:10:44,910
algorithm, and what is the value of an 
optimal solution. 

170
00:10:50,725 --> 00:11:01,518
So the correct answer is A. 
Since the knapsack capacity is a 1000 and 

171
00:11:01,518 --> 00:11:04,235
the sum of the sizes of the jobs is 
1,001. 

172
00:11:04,235 --> 00:11:08,926
There's no room for both of them. 
The greedy algorithm unfortunately, 

173
00:11:08,926 --> 00:11:12,282
because the first tiny item has a smaller 
ratio. 

174
00:11:12,282 --> 00:11:16,316
Well pack in item number 1, and that 
leaves no room for item number 2, so the, 

175
00:11:16,316 --> 00:11:20,592
value of the greedy algorithm solution is 
just 2, where as the optimal solution is 

176
00:11:20,592 --> 00:11:24,626
of course, to just take the second item. 
Yeah, its ratio is worse, but on the 

177
00:11:24,626 --> 00:11:26,928
other hand, it fills up the whole 
knapsack. 

178
00:11:26,928 --> 00:11:31,291
And so overall, your value is 1,000, 
which obviously blows away the Greedy 

179
00:11:31,291 --> 00:11:36,228
solution. 
This quiz shows that at least for general 

180
00:11:36,228 --> 00:11:44,049
instances the Greedy knapsack heuristic 
we've devised so far can be arbitararily 

181
00:11:44,049 --> 00:11:48,005
bad. 
It can be terrible with respect to an 

182
00:11:48,005 --> 00:11:51,905
optimal solution. 
There is, however, a simple fix to 

183
00:11:51,905 --> 00:11:55,072
address this issue. 
We're going to add a very simple step 

184
00:11:55,072 --> 00:11:58,863
three to our greedy heuristic. 
So the new step three is just going to 

185
00:11:58,863 --> 00:12:03,519
compare two candidate solutions in return 
which everyone is better, which everyone 

186
00:12:03,519 --> 00:12:07,163
has a bigger total value. 
The first candidate is just what we were 

187
00:12:07,163 --> 00:12:10,746
doing before, it's just the output of 
step two of the algorithm. 

188
00:12:10,746 --> 00:12:14,931
The second candidate is whichever item by 
itself has the maximum value. 

189
00:12:14,931 --> 00:12:18,778
In other words, this new greedy 
algorithm, it just runs the old one, but 

190
00:12:18,778 --> 00:12:22,774
then it does a final sanity check. 
It looks at each item individually and it 

191
00:12:22,774 --> 00:12:27,118
says, well, if this item just by itself 
dominates the solution I've computed thus 

192
00:12:27,118 --> 00:12:30,034
far, I return this single item by itself. 
Instead. 

193
00:12:30,034 --> 00:12:34,530
There's no question that adding this new 
Step 3 fixes the problem in the criz, 

194
00:12:34,530 --> 00:12:37,938
quiz in the last slide. 
If you add this Step 3, then we will 

195
00:12:37,938 --> 00:12:42,409
choose that item that has value 1,000 
instead of the lousy solution of value 2 

196
00:12:42,409 --> 00:12:46,226
that we computed in Step 2. 
But it may seem that this Step 3 is just 

197
00:12:46,226 --> 00:12:50,182
kind of a hack. 
Meant to address that specific instance. 

198
00:12:50,182 --> 00:12:55,990
But remarkably just adding step three 
transforms this greedy heuristic into one 

199
00:12:55,990 --> 00:13:00,508
that has a reasonably good worst-case 
performance guarantee. 

200
00:13:00,508 --> 00:13:05,749
Precisely, the value of the solution 
computed by this three-step greedy 

201
00:13:05,749 --> 00:13:10,852
algorthim is always at least 50% of the 
value of an optimal solution. 

202
00:13:10,852 --> 00:13:15,952
So, the jargan for this kind of algorithm 
is a 1/2 approximation algorithm. 

203
00:13:15,952 --> 00:13:21,202
For every instance, it's guaranteed to 
compute a solution at least 1/2 that of 

204
00:13:21,202 --> 00:13:25,227
an optimal solution. 
Also, as you would hope, from a greedy 

205
00:13:25,227 --> 00:13:30,602
heuristic, it runs extremely quickly. 
Basically all we do is sort the items, so 

206
00:13:30,602 --> 00:13:33,440
the running time is o of n, log n. 
Again. 

207
00:13:33,440 --> 00:13:37,431
In the next video, we'll provide a proof 
of this theorem. 

208
00:13:37,431 --> 00:13:43,110
And, as we'll see, under some additional 
assumptions about the type of instances 

209
00:13:43,110 --> 00:13:47,869
the performance guarantee will in fact be 
much, much better than 50%,