1
00:00:00,000 --> 00:00:05,701
So in order to compute these conditional
probabilities when there are a large

2
00:00:05,701 --> 00:00:12,565
number of possible words or features, we
need to do a little more work.

3
00:00:12,565 --> 00:00:18,648
Let's look at the simple case of just one
word, say red.

4
00:00:18,648 --> 00:00:26,403
There are two possibilities.
The red could be present or absent.

5
00:00:26,403 --> 00:00:37,024
So, for our cases red is present.
Out of these I have a Bi and the rest

6
00:00:37,024 --> 00:00:42,004
don't.
The total number of cases which are Y

7
00:00:42,004 --> 00:00:49,027
situations are obviously K as before.
So lets see what the simple condition

8
00:00:49,027 --> 00:00:55,508
probabilities are in this case.
The probability of a Bi given that there

9
00:00:55,508 --> 00:01:03,071
is an R is just I over R.
On the other hand, the probability of an R

10
00:01:03,071 --> 00:01:08,007
over all is just R over the total number
N.

11
00:01:10,054 --> 00:01:17,012
And the probability that both R occurs and
there is a Bi is I over N.

12
00:01:17,012 --> 00:01:24,045
Now this is the joint probability, so
you'll have to divide by the total number

13
00:01:24,045 --> 00:01:30,020
of instances N rather than any of the
conditional ones K or R.

14
00:01:30,020 --> 00:01:34,037
Baye's rule is actually just simple
arithmetic.

15
00:01:35,023 --> 00:01:42,073
In particular if you write I over N as I
over R times R over N, lets see what you

16
00:01:42,073 --> 00:01:46,098
get.
Well we already know what I over R is its

17
00:01:46,098 --> 00:01:51,051
just the conditional probability of a Bi
given R.

18
00:01:51,098 --> 00:01:59,014
And R over N is the probability of R
itself, so simple arithmetic tells us that

19
00:01:59,014 --> 00:02:06,068
the joint probability of R and B is the
product of the conditional probability and

20
00:02:06,068 --> 00:02:11,090
the a prior I probability.
This is just Bayes rule.

21
00:02:12,028 --> 00:02:18,058
The probability of B and R is the
conditional probability of B given R times

22
00:02:18,058 --> 00:02:24,096
P of R which is also the same thing as the
probability of R given B times the

23
00:02:24,096 --> 00:02:30,006
probability of B.
We can see that by just rewriting I over

24
00:02:30,006 --> 00:02:34,048
N, not by introducing an R, but by
introducing a K.

25
00:02:34,048 --> 00:02:41,070
In which case you'll get I over K, which
is just the probability of R out of all

26
00:02:41,070 --> 00:02:47,572
those that are B and the probability of B
which is just K over N, right?

27
00:02:47,572 --> 00:02:54,976
Bayes' Rule which some of you may or may
not remember turns out to be just simple

28
00:02:54,976 --> 00:02:58,872
arithmetic.
As we shall soon see, Bayes' Rule is

29
00:02:58,872 --> 00:03:05,396
critical to machine learning because it
allows us to compute any of those many,

30
00:03:05,396 --> 00:03:11,906
many joint probabilities even if there is
no data for a particular combination.

31
00:03:11,906 --> 00:03:17,765
Before we can do machine learning using
Bayesian techniques we need one more

32
00:03:17,765 --> 00:03:27,005
important concept that is independence.
Think about two words like red and cheap.

33
00:03:27,005 --> 00:03:36,062
As before, we have R equal to yes for R
queries, cheap equal to yes for C cases

34
00:03:36,062 --> 00:03:46,039
and in I cases both key words are present.
The probability of an R occurring is

35
00:03:46,039 --> 00:03:51,039
clearly R over N.
And the probability of cheap occurring is

36
00:03:51,039 --> 00:03:55,054
C over N, as before.
And similarly, the conditional

37
00:03:55,054 --> 00:04:00,046
probabilities of an R given that C occurs
can be computed.

38
00:04:00,046 --> 00:04:09,049
Independence says that the probability of
R does not depend on whether or not the

39
00:04:09,049 --> 00:04:14,023
word cheap is already present in the
query.

40
00:04:14,023 --> 00:04:23,060
In other words the probability of R should
be the same as the probability of R given

41
00:04:23,060 --> 00:04:27,090
C.
Similarly the probability that cheap

42
00:04:27,090 --> 00:04:33,086
occurs should not depend on whether or not
red occurs.

43
00:04:34,098 --> 00:04:39,027
In such situations these two features are
independent.

44
00:04:39,027 --> 00:04:43,072
Of course, that might not necessarily
always be the case.

45
00:04:43,072 --> 00:04:50,017
For example, somebody searching for big
data might actually search for map reduce

46
00:04:50,017 --> 00:04:55,018
at the same time rather than something
else, like red or flower.