1
00:00:00,000 --> 00:00:07,025
The most simple and fundamental technique
of machine learning is called the bayesian

2
00:00:07,025 --> 00:00:11,049
classifier.
And it's also called a Naive Bayesian

3
00:00:11,049 --> 00:00:17,051
Classifier, because it makes the
assumption that all the futures that one

4
00:00:17,051 --> 00:00:24,035
uses to learn about historical data are
independent for a particular choice of the

5
00:00:24,035 --> 00:00:29,087
behavior one is trying to learn.
So, for example, if one is trying to

6
00:00:29,087 --> 00:00:36,079
predict whether a given user is shopping
or just surfing using the query words that

7
00:00:36,079 --> 00:00:41,518
they use to search.
We assume that the words itself are all

8
00:00:41,518 --> 00:00:48,072
independent and equally likely to occur
independent of what other words are used

9
00:00:48,072 --> 00:00:52,662
in the query.
It's not necessarily of valid assumptions

10
00:00:52,662 --> 00:00:56,857
in most cases as we have just argued a
while back.

11
00:00:56,857 --> 00:01:02,869
But that's what is assumed and that's why
it's called naive based classifiers.

12
00:01:02,869 --> 00:01:08,308
Anyway, let's try to figure this out from
first principles.

13
00:01:08,308 --> 00:01:18,015
The probability, of B that is, whether
some user is a buyer, or just a browser.

14
00:01:18,015 --> 00:01:28,298
Given, the combination of R and C that
they use times the probability of R and C

15
00:01:28,298 --> 00:01:34,065
itself.
By Bayes rule, this is nothing but the

16
00:01:34,065 --> 00:01:40,024
probability of R and C, given B the
probability of B.

17
00:01:40,024 --> 00:01:48,014
This is easy to see just by thinking of R,
C as one random variable, and just using

18
00:01:48,014 --> 00:01:56,087
Bayes rule.
Next, we further expand the first term

19
00:01:56,087 --> 00:02:02,087
over here.
Again using Bayes rule, forgetting about B

20
00:02:02,087 --> 00:02:09,828
for the time being.
And, noting that the probability of R and

21
00:02:09,828 --> 00:02:15,518
C given B is the probability of R given C
and B is there just because it was

22
00:02:15,518 --> 00:02:21,049
earlier, times the probability of c given
b which again is there because it was

23
00:02:21,049 --> 00:02:27,040
there up there.
Again, by applying Bayes rule we get this

24
00:02:27,040 --> 00:02:34,275
expansion.
Now comes the independence assumption.

25
00:02:34,275 --> 00:02:42,056
The probability that of R given C and B or
A or whatever else, it doesn't matter

26
00:02:43,036 --> 00:02:50,004
because of independence of R and C is the
same as the probability of R given

27
00:02:50,004 --> 00:02:54,021
whatever else was there on the conditional
side.

28
00:02:55,069 --> 00:03:04,028
Because of this independence we can
essentially separately compute these terms

29
00:03:04,028 --> 00:03:11,068
probability of R given by equal to yes,
Probability of C given by equal to yes.

30
00:03:11,097 --> 00:03:19,026
And then just simply multiply by it by the
a priori, or the overall probability, of

31
00:03:19,026 --> 00:03:25,025
by equal to yes.
We can do the same for by equal to know.

32
00:03:25,078 --> 00:03:33,087
So given some values of R and C, that is
whether the word red occurs or the word

33
00:03:33,087 --> 00:03:42,044
cheap occurs or not.
We compute the probability of R being

34
00:03:42,044 --> 00:03:49,757
there given that by is yes.
C being there or not there given by equal

35
00:03:49,757 --> 00:03:53,674
to yes.
And the a priori probability of B equal to

36
00:03:53,674 --> 00:04:00,163
yes and divide it by the same expression
that we have here but for the other

37
00:04:00,163 --> 00:04:07,553
possibility that i equal to a no.
Now, since this expression is proportional

38
00:04:07,553 --> 00:04:12,476
to the conditional probability that we're
looking for.

39
00:04:12,476 --> 00:04:20,484
That is whether or not the user is more
likely to be a buyer versus a browser

40
00:04:20,484 --> 00:04:27,781
given the query words that they used, we
simply compute this ratio for both

41
00:04:27,781 --> 00:04:34,577
possibilities buyer or browser and if this
ratio is greater than one.

42
00:04:34,577 --> 00:04:41,425
In general creating some threshold we
chose the option of buy equal to yes as

43
00:04:41,425 --> 00:04:47,750
being the most likely estimate.
Otherwise, we decide that buyer equal to

44
00:04:47,750 --> 00:04:49,601
no.
So, congratulations.

45
00:04:49,601 --> 00:04:56,305
We have just derived the most fundamental
classification mechanism on machine

46
00:04:56,305 --> 00:05:02,911
learning technique called the Naive Bayes
Classifier from first principles.

47
00:05:02,911 --> 00:05:11,583
And, as you can see, the derivation is
simple arithmetic and basic probability.

48
00:05:11,583 --> 00:05:21,619
Given any combination of words and the
past history say you have four features

49
00:05:21,619 --> 00:05:30,660
that we had earlier.
Red flower gift and cheap and in general N

50
00:05:30,660 --> 00:05:35,488
features.
Arbitrary long queries for example, taking

51
00:05:35,488 --> 00:05:41,512
particular values yes, no, yes, no,
etcetera, etcetera, weakens merely compute

52
00:05:41,512 --> 00:05:46,836
the ratio that we had earlier.
Simply now that we have N terms in the

53
00:05:46,836 --> 00:05:52,697
product as opposed to just two.
And in the end, the a priori probabilities

54
00:05:52,697 --> 00:05:59,091
of the two different types of behavior
that we're trying to predict.

55
00:05:59,091 --> 00:06:07,637
This ratio is called a likelihood ratio
because each of these terms in the product

56
00:06:07,637 --> 00:06:16,122
is a likelihood in the sense that if x1 =
yes then the probability here is the

57
00:06:16,122 --> 00:06:24,916
probability that x1 = yes for the
situation when people are buying versus

58
00:06:24,916 --> 00:06:32,027
the probability that x1 = yes when people
are just browsing.

59
00:06:33,045 --> 00:06:40,034
And we do that for all the possible words.
Many of them will have a, no, so xi's for

60
00:06:40,034 --> 00:06:45,029
them will be, no.
But we still have to take all the possible

61
00:06:45,029 --> 00:06:49,662
words into account.
And choose yes, if this likelihood ratio

62
00:06:49,662 --> 00:06:53,440
is greater than a threshold and, no,
otherwise.

63
00:06:53,440 --> 00:07:00,146
We also normally take logarithms to make
multiplications into additions, so you

64
00:07:00,146 --> 00:07:03,851
would frequently hear the term log
likelihood.

65
00:07:03,851 --> 00:07:09,976
The reason is because off these many, many
probabilities all less than one being

66
00:07:09,976 --> 00:07:13,728
multiplied.
You probably will run out of decimal

67
00:07:13,728 --> 00:07:19,691
points and you'll get all zeros.
So by taking logarithms you will reduce

68
00:07:19,691 --> 00:07:25,036
the and of such underflow occurring and
make the computation stable.

69
00:07:25,077 --> 00:07:31,037
In particular that you really need to
consider all possible features.

70
00:07:31,080 --> 00:07:36,855
Even those that don't occur in the query.
So, if you have a million words, and your

71
00:07:36,855 --> 00:07:43,936
query consists of four words, you need to
multiply the probabilities for a million

72
00:07:43,936 --> 00:07:48,014
words.
For the four words that you have, you'll

73
00:07:48,014 --> 00:07:54,084
use the probability that the word occurs
for buyers and the probability that the

74
00:07:54,084 --> 00:08:00,022
word occurs for browsers.
And for the rest, you'll have to include

75
00:08:00,022 --> 00:08:06,303
the probabilities that the word does not
occur for buyers and for browsers in the

76
00:08:06,303 --> 00:08:10,649
denominator.
Now this makes it an extremely expensive

77
00:08:10,649 --> 00:08:18,126
method, but never the less quite useful.
The expensiveness we'll figure out how to

78
00:08:18,126 --> 00:08:21,520
deal with using parallel computing next
week.

79
00:08:21,520 --> 00:08:28,017
But for the moment lets take a look at an
example of trying to compute sentiment

80
00:08:28,017 --> 00:08:32,060
using machine learning and the Naive Bayes
Classifier.