1
00:00:00,000 --> 00:00:06,785
.
So Knuth-Morris-Pratt is a linear time

2
00:00:06,785 --> 00:00:11,721
algorithm, we can't, surely we can't do
better than that, can we?

3
00:00:11,721 --> 00:00:03,586
Well, it's not the end of the story.
Now we're going look at the Boyer-Moore

4
00:00:17,692 --> 00:00:23,504
algorithm. And it's also a very simple
idea that's extremely effective in

5
00:00:23,504 --> 00:00:25,734
practice.
So here's the idea.

6
00:00:25,734 --> 00:00:32,762
Instead of matching the pattern against
the text moving from left to right in both

7
00:00:32,762 --> 00:00:37,376
pattern and text.
Boyer-moore said, what, what if we try to

8
00:00:37,607 --> 00:00:46,080
scan the characters in the pattern from
right to left?" so, then, when we find a

9
00:00:46,080 --> 00:00:50,458
mismatch we can skip, we know all the
characters in the pattern.

10
00:00:50,458 --> 00:00:55,393
So, for example, if we're looking for a
needle in the, in a haystack, if we first

11
00:00:55,393 --> 00:01:01,092
look at the last character in the pattern,
which is an E, and we find that N in the

12
00:01:01,092 --> 00:01:04,715
text.
Now then, it's clear that the next

13
00:01:04,715 --> 00:01:11,146
possible match is when the Ns lined up,
Cuz if it was any other lineup, it'd be

14
00:01:11,146 --> 00:01:16,630
all these characters that we know are not
N, compared against N, so we might as well

15
00:01:16,630 --> 00:01:21,784
just move it so the Ns are lined up.
And that's good but that's not as good as

16
00:01:21,784 --> 00:01:25,154
the next case.
If we happen to run into a character

17
00:01:25,154 --> 00:01:29,448
that's not in the pattern, then we can
just skip all the way over.

18
00:01:29,448 --> 00:01:34,139
We don't have to compare any of our
pattern character against that one.

19
00:01:34,139 --> 00:01:37,641
So we can just add m if we have m-pattern
characters.

20
00:01:38,413 --> 00:01:43,075
And so then in this case we have a more
complicated situation.

21
00:01:43,075 --> 00:01:49,098
We match the E, and we have a bunch of Es,
And so, part of the algorithm is to figure

22
00:01:49,098 --> 00:01:54,190
out, when you do match a character that's
in the pattern and that what do you do.

23
00:01:54,190 --> 00:01:59,855
But the intuition is, it's fast, because
you're often going to have this kind of

24
00:01:59,855 --> 00:02:05,090
case where you find a mismath.
And since you're going from right to left,

25
00:02:05,090 --> 00:02:10,002
that's you know, that, there's character
in the text that's not in the pattern at

26
00:02:10,002 --> 00:02:12,615
all, you can shift over N characters at a
time.

27
00:02:12,615 --> 00:02:15,740
And that's a huge win in a lot of
practical situation.

28
00:02:16,090 --> 00:02:19,475
So,
So how much do we skip?

29
00:02:19,475 --> 00:02:25,079
So the, the first case is clear.
If there's,

30
00:02:25,430 --> 00:02:32,252
If you run into a text character that's
not in the pattern, then you just move I

31
00:02:32,252 --> 00:02:39,159
from wherever it was to one character
beyond the current one that you're looking

32
00:02:39,159 --> 00:02:42,741
at.
So, and that's an easy, easy calculation,

33
00:02:42,741 --> 00:02:49,734
it's just increment it by the number of
pattern characters that you've not looked

34
00:02:49,734 --> 00:02:53,710
at.
Now, but if you have a character in a

35
00:02:53,710 --> 00:02:59,130
pattern then,
You want to align the text and in this

36
00:02:59,130 --> 00:03:04,713
case, with the rightmost pattern, and, and
that's pretty good.

37
00:03:04,713 --> 00:03:10,030
But you don't always want to do that.
If you for example,

38
00:03:10,268 --> 00:03:15,271
We have a mismatch on E.
If you were to say, oh, let's line up on

39
00:03:15,271 --> 00:03:19,798
the rightmost most E on the pattern, that
would involve backup.

40
00:03:19,798 --> 00:03:25,198
So you don't want to do backup.
In that case, you just move on by one.

41
00:03:25,198 --> 00:03:29,089
So there are times when the heuristic is
no help.

42
00:03:29,089 --> 00:03:34,490
So what you really want to do is just
increment by one in that case.

43
00:03:34,826 --> 00:03:45,587
So just given, those basic preliminaries,
It's, clear that it's, actually easy to go

44
00:03:45,587 --> 00:03:52,089
ahead and precompute, how much, you might
want to skip.

45
00:03:52,425 --> 00:04:03,299
So, you start out with, with -one.
So this is the, amount, that, you're going

46
00:04:03,299 --> 00:04:12,396
to increment the, the text pointer.
N - one means you just it's not in the

47
00:04:12,396 --> 00:04:15,871
pattern, so you're just going to move on
one.

48
00:04:15,872 --> 00:04:20,060
And so all you do is go through the
pattern,

49
00:04:20,306 --> 00:04:26,219
And so let's start with N.
So we want too fill in this table for N.

50
00:04:26,465 --> 00:04:32,953
And it says, so if we, if we are going to
find an N in the text, what do we want to

51
00:04:32,953 --> 00:04:36,895
do.
Well we want to move to the next text

52
00:04:36,895 --> 00:04:43,794
character sorry, we want to co, compare
the next text character against this one,

53
00:04:43,794 --> 00:04:48,557
which is zero.
And so right of path.charAt j = j.

54
00:04:48,557 --> 00:04:51,841
It's going to, just going to fill in that
zero.

55
00:04:51,842 --> 00:05:03,207
Then we go to j = one for the E,
E, D, L. it's same, so for what happens is

56
00:05:03,207 --> 00:05:08,882
for if you do that for every letter in the
pattern, if you have a letter that appears

57
00:05:08,882 --> 00:05:13,869
more often it gets overwritten until the
right most occurrence is the one that's

58
00:05:13,869 --> 00:05:16,563
there.
So, that precomputation is just one path

59
00:05:16,563 --> 00:05:22,156
through the pattern.
And then giving that precomputation and,

60
00:05:22,466 --> 00:05:31,876
again we're going to implement, this is an
implementation of Boyer-Moore, for using

61
00:05:31,876 --> 00:05:39,853
for loop incrementing in the text
characte. And so we, are going to have the

62
00:05:39,853 --> 00:05:43,880
text pointer I.
I in the course of the algorithm, we're

63
00:05:43,880 --> 00:05:49,574
going to compute a value skip which is the
amount that we're going to move the text

64
00:05:49,574 --> 00:05:53,324
character.
So we go all the way through the pattern

65
00:05:53,324 --> 00:05:57,768
from right to left.
If we get all the way through the pattern

66
00:05:57,976 --> 00:06:03,323
and we find a match, and we don't change
the value of skip, then we've found a

67
00:06:03,323 --> 00:06:06,884
match.
If we don't, sorry, if we get all the way

68
00:06:06,884 --> 00:06:10,247
through the pattern and don't find a
mismatch.

69
00:06:10,466 --> 00:06:15,950
It's all matches we don't change the value
of skip and we found a match.

70
00:06:15,950 --> 00:06:22,165
If we do find a mismatch than we're going
to compute the value that we skip, which

71
00:06:22,165 --> 00:06:26,626
is where we are minus that table that
thing that we computed.

72
00:06:26,845 --> 00:06:33,279
And that's the amount that now that we're
going to add to the text character to take

73
00:06:33,279 --> 00:06:41,081
care of the mismatch for the right to left
mismatch. and if it's, it's, always going

74
00:06:41,081 --> 00:06:47,081
to be at least one, but if this thing is
negative, we're not going to back up.

75
00:06:47,081 --> 00:06:51,678
That's, that's it.
So it's a very simple implementation based

76
00:06:51,678 --> 00:06:57,207
on this idea of moving from right to left
and skipping over the whole thing if we

77
00:06:57,207 --> 00:07:00,914
find a mismatch.
And the key thing about Boyer-Moore you

78
00:07:00,914 --> 00:07:06,003
can study that implementation and figure
it out easily, but the key thing is that

79
00:07:06,192 --> 00:07:10,590
this mismatched character here is, takes
time proportional to N over M.

80
00:07:10,590 --> 00:07:14,522
That's kind of amazing.
We had started out with a brute force

81
00:07:14,522 --> 00:07:18,455
which was N  N,
And then we got linear time, we were happy

82
00:07:18,455 --> 00:07:23,549
with N, but actually there's a lot of
practical situations where you can do the

83
00:07:23,549 --> 00:07:27,288
search in N over M.
The longer the pattern gets, the faster

84
00:07:27,288 --> 00:07:31,544
the search gets.
It's not only sublinear, it gets better.

85
00:07:31,775 --> 00:07:37,380
That's kind of amazing.
Now there's a caveat there because there

86
00:07:37,380 --> 00:07:41,910
is a worse case that could be as bad as
the brute force.

87
00:07:42,111 --> 00:07:47,811
So this is just a example of the worst
case where you go from right to left,

88
00:07:47,811 --> 00:07:53,377
let's say, and always get to the first
character before finding the mismatch, and

89
00:07:53,377 --> 00:07:56,796
you can't do anything better than move
over one.

90
00:07:56,797 --> 00:08:02,766
But actually, what you can do is build
something kind of like Knuth-Morris-Pratt

91
00:08:02,967 --> 00:08:09,338
to make sure that you don't do something
like this with a, repetitive pattern.

92
00:08:09,338 --> 00:08:13,026
And you, you can get the worst case to the
linear,

93
00:08:13,227 --> 00:08:19,496
But it's really of importance for, for
intermediate length patterns with, because

94
00:08:19,496 --> 00:08:24,325
you so often are going to be able to get
this N over M performance.

95
00:08:24,532 --> 00:08:28,740
And that's a widely used algorithm.
The Boyer-Moore algorithm.