1
00:00:00,000 --> 00:00:06,909
Let's now revisit machine learning, like
we studied during the listen chapter and

2
00:00:06,909 --> 00:00:12,540
the naive based classifier is one method
for the, machine learning.

3
00:00:13,640 --> 00:00:17,420
Let's try to do this a little bit
formally.

4
00:00:18,000 --> 00:00:23,840
In a manner so as to make some of the
material that will come later fall

5
00:00:23,840 --> 00:00:31,560
naturally into the same formal framework.
We have a bunch of data.

6
00:00:31,560 --> 00:00:39,753
In this case, features X1 to XN which are.
We call X, capital X a, a vector, a

7
00:00:39,753 --> 00:00:45,100
collection of features.
Earlier we had different examples, you

8
00:00:45,100 --> 00:00:50,097
know, the query terms that we used, the
words in a comment.

9
00:00:50,097 --> 00:00:56,935
And in general, anything that we observe
about instances which we then want to

10
00:00:56,935 --> 00:01:02,356
classify into different.
Classes and then we have some output

11
00:01:02,356 --> 00:01:08,785
variables which are the labels that we
would like to assign to a class, whether a

12
00:01:08,785 --> 00:01:14,420
positive sentiment or a negative sentiment
or a buyer, or a browser, and.

13
00:01:15,060 --> 00:01:18,844
In this case, we'll say the output
variable is zero.

14
00:01:18,844 --> 00:01:22,331
If it's a buyer, a one.
If it's a browser, a zero.

15
00:01:22,331 --> 00:01:26,190
If it's a negative, a one.
If it's positive, and so on.

16
00:01:26,190 --> 00:01:30,420
In general, we might have many different
output variables.

17
00:01:30,720 --> 00:01:38,090
Each of them 01, and.
The points in the space, X, could be shown

18
00:01:38,090 --> 00:01:41,956
as, you know, points in, in some high
dimensional space.

19
00:01:42,175 --> 00:01:45,750
I've shown them here as just points in the
plane.

20
00:01:45,750 --> 00:01:49,835
But in general each of these will have
many coordinates.

21
00:01:49,835 --> 00:01:54,284
Like N for example.
And some of them are red which means they

22
00:01:54,284 --> 00:01:58,150
have zero assigned to them.
And some of them are blue.

23
00:01:58,150 --> 00:02:03,330
That means they have a one assigned to
them, in terms of the Y variable.

24
00:02:03,330 --> 00:02:10,031
And the,, the goal of classification is to
figure out, given a new point whether it's

25
00:02:10,031 --> 00:02:14,687
likely to bem red or blue.
In general, this need not to be a binary

26
00:02:14,687 --> 00:02:20,401
task and you may want to classified in to
three, four many classes, but the moment

27
00:02:20,401 --> 00:02:25,340
we stick to binary classification, as it
makes things a bit simpler.

28
00:02:26,600 --> 00:02:34,638
Now, to formally model the classification
problem, if you have a bunch of features

29
00:02:34,638 --> 00:02:42,860
in a space, we define a function, let's
say, F of X which is the expected value Of

30
00:02:42,860 --> 00:02:48,206
the output variables y, given the input
variable, x.

31
00:02:48,206 --> 00:02:56,117
Now, this is a little bit different than
what we had earlier, which was the

32
00:02:56,117 --> 00:03:01,444
probability.
Of getting some variable Y equal to zero

33
00:03:01,444 --> 00:03:07,885
or one, given a particular combination of
X's but as we shall soon see it really its

34
00:03:07,885 --> 00:03:12,947
the same thing, at least for
classification, but for other types of

35
00:03:12,947 --> 00:03:19,082
machine learning we'll have different
values of F and that's why this framework

36
00:03:19,082 --> 00:03:24,220
is useful so that it unifies our whole
concept of machine learning.

37
00:03:25,340 --> 00:03:31,602
So, what we're trying to do is figure out
what is the expected value of y.

38
00:03:31,602 --> 00:03:36,749
Is it a zero or a one?
Is it closer to zero or closer to one

39
00:03:36,749 --> 00:03:44,760
given a particular combination x?
Well, the expected value is nothing but.

40
00:03:45,680 --> 00:03:50,560
And clearly, Y is zero one, then the
expected value is.

41
00:03:50,560 --> 00:03:56,928
One times the probability of Y equal to
one, given X, plus zero times the

42
00:03:56,928 --> 00:04:00,820
probability that Y equal to zero, given X.
So.

43
00:04:00,820 --> 00:04:07,012
Since you multiply by zero, this term
disappears, and we simply get the

44
00:04:07,012 --> 00:04:14,088
probability that Y equal to one, given X.
And this is exactly what we had earlier

45
00:04:14,088 --> 00:04:20,900
estimated using various assumptions, like
independence and the bye, Bayes rule.

46
00:04:21,400 --> 00:04:27,617
And we will, could figure out that then
we, all we needed was a training set were

47
00:04:27,617 --> 00:04:33,678
we could compute the required likelihoods
that would enable us to compute the

48
00:04:33,678 --> 00:04:37,720
probability of y equal to one given a
combination x.

49
00:04:38,000 --> 00:04:44,599
Now that's, that's fairly basic.
But this formulation tells us a few

50
00:04:44,599 --> 00:04:45,880
things.
First.

51
00:04:46,580 --> 00:04:54,760
It's not always necessary to compute this
probability y equal to one given x.

52
00:04:55,300 --> 00:04:59,470
Directly.
We did that through some approximation in

53
00:04:59,470 --> 00:05:03,559
ninth phase.
We might instead want to compute this

54
00:05:03,559 --> 00:05:08,220
function F of X.
In this case it turns out to be just the

55
00:05:08,220 --> 00:05:13,535
same as this probability.
In other cases it might not be the same.

56
00:05:13,535 --> 00:05:20,568
For example if you have a classifier where
you didn't have two classes you might have

57
00:05:20,568 --> 00:05:26,340
a different combination of F.
The second thing it tells you is that.

58
00:05:26,680 --> 00:05:33,383
The probability.
It's one way of figuring out where a

59
00:05:33,383 --> 00:05:40,670
particular combination X belongs.
Another way might be to somehow estimate

60
00:05:40,670 --> 00:05:49,944
this F and estimate the boundary.
Where F of X is more than or less than a

61
00:05:49,944 --> 00:05:54,335
half.
For example, in this case, if you could

62
00:05:54,335 --> 00:06:00,282
figure out that the boundary lay somewhere
along this line over here.

63
00:06:00,282 --> 00:06:06,229
So that we minimize the number of false
positives and true negatives.

64
00:06:06,229 --> 00:06:12,779
We could find a much easier way to
classify the points as being positive or

65
00:06:12,779 --> 00:06:20,105
negative than actually computing this
rather complex animal, which is the a

66
00:06:20,105 --> 00:06:26,383
posteriori probability.
Directly trying to find estimates of F is

67
00:06:26,383 --> 00:06:32,136
what more sophisticated machine learning
techniques, like support vector machines,

68
00:06:32,136 --> 00:06:36,091
end up doing.
They do other complicated things like,

69
00:06:36,091 --> 00:06:41,915
changing the nature of the space.
So instead of dealing with the space at

70
00:06:41,915 --> 00:06:47,899
its given, they use, various combinations
of the x's, sometimes squaring them,

71
00:06:47,899 --> 00:06:53,484
sometimes doing, other, funny things to
them to, project this space to a

72
00:06:53,484 --> 00:06:59,468
different, space, and then find, a line or
a plane which, nicely separates, the,

73
00:06:59,707 --> 00:07:05,704
instances of positive and negative.
So what they're really doing is trying e-,

74
00:07:05,704 --> 00:07:10,476
estimate the function boundary F directly,
without actually computing the function

75
00:07:10,476 --> 00:07:14,381
directly.
Now, that's another long-winded way of

76
00:07:14,381 --> 00:07:18,200
defining machine learning, but it, it will
serve to.

77
00:07:18,620 --> 00:07:24,869
Unify the different types of learning that
we'll see as we go along this week.

78
00:07:24,869 --> 00:07:31,520
Let's take a look at some examples using
this new notation that we just introduced.

79
00:07:31,920 --> 00:07:39,147
Recall our machine learning of whether
somebody's a, buyer or a browser, based on

80
00:07:39,147 --> 00:07:42,979
the queries that they, issue in a search
box.

81
00:07:42,979 --> 00:07:49,684
In this case the queries that we had last
time were, red, flower, gift, or cheap.

82
00:07:49,684 --> 00:07:56,650
And we had various instances of people
querying, and then possibly buying or not

83
00:07:56,650 --> 00:08:03,611
buying.
So our x was essentially the set of query

84
00:08:03,611 --> 00:08:11,480
words, y was whether or not one buys.
And so we have y, x as our.

85
00:08:12,120 --> 00:08:19,720
Space, B, R, F, G and C in the notation we
had last time.

86
00:08:20,480 --> 00:08:26,587
Similarly, we had machine learning of
positive or negative comments.

87
00:08:26,587 --> 00:08:33,782
In this case, the sentiment was the y.
The set of all words formed our X, which

88
00:08:33,782 --> 00:08:40,613
is extremely high dimensional space.
A word could either be present or absence

89
00:08:40,613 --> 00:08:45,780
in a comment, and all of these would still
binary variables.

90
00:08:48,120 --> 00:08:58,900
We turn to another example now.
Imagine a baby observing various animals,

91
00:09:00,660 --> 00:09:06,605
And wants to figure out which animal is of
what name.

92
00:09:06,605 --> 00:09:12,719
So, we have various features, the size of
the animal, the size of the head, the

93
00:09:12,719 --> 00:09:18,430
noise the animal makes, the number of legs
it has and the animal itself.

94
00:09:18,430 --> 00:09:24,624
And the baby observes many instances and
somehow it is able to discern these

95
00:09:24,624 --> 00:09:30,898
features in terms of whether or not the
size of the head is large or small or

96
00:09:30,898 --> 00:09:34,759
medium.
Or whether the animal itself is large or

97
00:09:34,759 --> 00:09:38,460
small.
And the noise that the animal makes, etc.

98
00:09:39,980 --> 00:09:47,253
The machine learning task is then to
classify a new animal into appropriate

99
00:09:47,253 --> 00:09:50,220
category.
Lion, cat, elephant, etc.

100
00:09:50,900 --> 00:09:58,772
So here Y comma X is now the animal itself
is Y, size of the animal, it's head size,

101
00:09:58,772 --> 00:10:04,436
it's noise, number of legs or the
features, it's a fixed set.

102
00:10:04,436 --> 00:10:12,020
So here on the fore features of multi
valued categorical variable so they take.

103
00:10:12,380 --> 00:10:17,003
Values in specific categories.
They don't, they're not real numbers for

104
00:10:17,003 --> 00:10:19,645
example.
They're, they take one of, of few

105
00:10:19,645 --> 00:10:22,420
categories: small, medium, large for
example.

106
00:10:22,880 --> 00:10:30,442
And lastly we consider another example
where we have customers going to

107
00:10:30,442 --> 00:10:38,009
supermarkets and buying bunches of
products together in transactions A

108
00:10:38,009 --> 00:10:42,263
transaction might consists of milk,
diapers and some Cola.

109
00:10:42,263 --> 00:10:46,220
An other one might consists of some
diapers and beer.

110
00:10:46,220 --> 00:10:50,930
Yet more transactions will have different
products.

111
00:10:50,930 --> 00:10:56,089
And here we have.
Interestingly no output variable, but only

112
00:10:56,089 --> 00:11:02,925
features which are just the items that
people buy and the items are again multi

113
00:11:02,925 --> 00:11:10,189
valued categorical variables but they're a
variables set just like the comments that

114
00:11:10,189 --> 00:11:17,111
we had earlier that we could, we could
consider the set of all possible items and

115
00:11:17,111 --> 00:11:23,350
have binary variables or we could have a
variable number of features and.

116
00:11:23,350 --> 00:11:32,140
Multi-valued categorical variables.
These examples are all cases where one

117
00:11:32,140 --> 00:11:35,294
could do, various kinds of machine
learning.

118
00:11:35,294 --> 00:11:40,942
In the first three examples, that is
queries, comments and animals, there was a

119
00:11:40,942 --> 00:11:46,443
clear output variable, which indicated the
class of the particular instance.

120
00:11:46,443 --> 00:11:50,744
And so one could.
Imagine a supervised machine learning

121
00:11:50,744 --> 00:11:56,372
scenario, using something like a naive
based classifier, to compute the

122
00:11:56,372 --> 00:12:02,804
likelihoods and estimate the A plus GRI
probability, or as we have just seen, the

123
00:12:02,804 --> 00:12:08,833
expected value of the class, given X.
In the last example transactions, there

124
00:12:08,833 --> 00:12:15,184
was no output variable, so the task there
is a little bit different, which, which

125
00:12:15,184 --> 00:12:22,910
we, which we shall come to, very soon.
Do go back over the formalism, and make

126
00:12:22,910 --> 00:12:31,036
sure that you're able to figure out how.
Classification happens in each of these

127
00:12:31,036 --> 00:12:34,895
cases.
That is exactly how the Y and the X are

128
00:12:34,895 --> 00:12:41,440
formulated so that we get a formula, a
formal representation of the problem in

129
00:12:41,440 --> 00:12:48,722
terms of features and an output variable.
In this particular case of course there is

130
00:12:48,722 --> 00:12:52,051
no output variable as we have already
mentioned.

131
00:12:52,051 --> 00:12:55,520
And let's now get into the reason why we
had that.