1
00:00:00,000 --> 00:00:05,086
So it appears that mutual information
tells us which features, words in our

2
00:00:05,086 --> 00:00:12,831
example earlier, are good predictors of
the behavior we want to predict, should we

3
00:00:12,831 --> 00:00:18,012
not use those with the highest mutual
information, as our features.

4
00:00:20,000 --> 00:00:24,076
The trouble is the actual mutual
information from the formula is very

5
00:00:24,076 --> 00:00:29,093
difficult to compute exhaustively.
There are so many different possibilities

6
00:00:29,093 --> 00:00:34,082
with large numbers of features.
So in practice we use proxies.

7
00:00:34,082 --> 00:00:40,043
Well, a good proxy that we've seen earlier
is the Universe Document Frequency.

8
00:00:40,043 --> 00:00:45,034
There are other techniques.
Adaboost, in particular, is an important

9
00:00:45,034 --> 00:00:48,098
algorithm.
But we won't go into that in detail in

10
00:00:48,098 --> 00:00:52,077
this course.
For the moment, think about using those

11
00:00:52,077 --> 00:00:59,001
words with high English document frequency
as a proxy for words which are likely to

12
00:00:59,001 --> 00:01:03,039
be good features.
Another question we might ask is, are more

13
00:01:03,039 --> 00:01:09,022
features always good?
The Y axis here is measuring the error in

14
00:01:09,022 --> 00:01:16,034
the classification, that is how often
naive base gets the wrong answer.

15
00:01:17,018 --> 00:01:24,082
This error improves as we add features,
but after some time it starts to degrade

16
00:01:24,082 --> 00:01:27,757
again.
Why might this be happening?

17
00:01:27,757 --> 00:01:35,037
What do you think?
Perhaps we are using the wrong features to

18
00:01:35,037 --> 00:01:39,099
start with.
It turns out that, that's not the whole

19
00:01:39,099 --> 00:01:44,475
story either.
In this example the features having the

20
00:01:44,475 --> 00:01:52,060
lowest, mutual information, or information
gain which is another term for the same

21
00:01:52,087 --> 00:01:56,076
idea.
It's those bad features which are used

22
00:01:56,076 --> 00:02:02,057
first and then the good features come
later still the classifier goes a wire.

23
00:02:03,000 --> 00:02:07,090
So what's going on?
Can you guess?

24
00:02:09,021 --> 00:02:15,054
Remember that there is a reason why naive
base is called naive.

25
00:02:15,054 --> 00:02:23,018
It doesn't like redundant features.
It assumes that features are independent.

26
00:02:23,018 --> 00:02:30,061
It likes the fact that features have very
small mutual information amongst

27
00:02:30,061 --> 00:02:36,021
themselves.
The trouble is that's not always the case.

28
00:02:36,021 --> 00:02:40,040
And this is one reason why the technique
can fail.

29
00:02:41,081 --> 00:02:47,039
It gets confused because it assumes that
features are independent.

30
00:02:47,039 --> 00:02:53,081
Well, in principle, one should be able to
compute the best features, either by

31
00:02:53,081 --> 00:03:00,040
computing the mutual information directly
or using a proxy, as well as somehow

32
00:03:00,040 --> 00:03:07,990
figuring out which features are dependent.
And choose those best features which are

33
00:03:07,990 --> 00:03:12,893
also dependent.
Many, many machine learning techniques do

34
00:03:12,893 --> 00:03:17,740
exactly this.
We don't have time to go into the

35
00:03:17,740 --> 00:03:23,041
techniques in detail.
But, the idea should be clear by now.

36
00:03:23,093 --> 00:03:30,033
Let's return now to looking at machine
learning from the perspective of

37
00:03:30,033 --> 00:03:36,676
information theory.
We have a machine learning algorithm which

38
00:03:36,676 --> 00:03:45,061
takes the sequence of observations such as
comments and classifies them as positive

39
00:03:45,061 --> 00:03:51,063
or negative.
In a manner so as to maximize the mutual

40
00:03:51,063 --> 00:03:59,278
information between the observations and
their actual classifications versus the

41
00:03:59,278 --> 00:04:04,036
ones that the algorithm manages to
predict.

42
00:04:06,018 --> 00:04:12,096
Relating this to what Shannon defined as
the capacity of a communication channel in

43
00:04:12,096 --> 00:04:19,050
his this was an actual communication
channel if you remember like a telephone

44
00:04:19,050 --> 00:04:25,099
channel or a radio channel.
He, Luke, was worried about how fast you

45
00:04:25,099 --> 00:04:29,078
could transmit information on such
channel.

46
00:04:29,078 --> 00:04:36,091
So, he defined the capacity of a channel
as the maximum information that could be

47
00:04:36,091 --> 00:04:41,075
transferred between the sender and
receiver per second.

48
00:04:41,075 --> 00:04:47,022
So, the element of speed comes in when you
talk about capacity.

49
00:04:48,037 --> 00:04:52,082
What does this mean in the context of
machine learning?

50
00:04:52,082 --> 00:04:56,021
Is there an equivalent, notion of
capacity?

51
00:04:56,021 --> 00:05:00,009
How, fast can a machine algorithm actually
learn?

52
00:05:00,009 --> 00:05:07,778
Or what does it mean to be fast?
It turns out that there have been a lot of

53
00:05:07,778 --> 00:05:14,542
work on the theory of machine learning.
The pioneer on this is Leslie Valiant who

54
00:05:14,542 --> 00:05:20,800
won the touring award in 2011.
And other important papers defined

55
00:05:20,800 --> 00:05:30,111
something called the VC dimension using
which it was shown in this paper the right

56
00:05:30,111 --> 00:05:39,123
bayesian classifier will eventually learn
any concept, or any distinction between

57
00:05:39,123 --> 00:05:47,020
plus and minus, yin and yang.
The trouble is it need not run as fast.

58
00:05:47,020 --> 00:05:52,363
What does that mean?
How many training examples does a

59
00:05:52,363 --> 00:06:00,187
classifier require to learn a concept?
That's the equivalent of speed in the

60
00:06:00,187 --> 00:06:06,271
world of machine learning.
And how fast depends on the concept

61
00:06:06,271 --> 00:06:13,366
itself, and the VC dimension or the Vapnik
Chervonenkis Dimension of a concept can be

62
00:06:13,366 --> 00:06:17,192
measured.
Using which, this paper showed that

63
00:06:17,192 --> 00:06:21,677
Bayesian learning can eventually learn any
concept.

64
00:06:21,677 --> 00:06:26,290
And the speed depends on the VC dimension
of the concept.

65
00:06:26,290 --> 00:06:32,523
Well, that's all we're going to do
regarding machine learning theory for the

66
00:06:32,523 --> 00:06:37,875
moment.
Let's return now to the question of

67
00:06:37,875 --> 00:06:46,358
whether sentiment analysis is actually
measuring an opinion about a product, a

68
00:06:46,358 --> 00:06:53,183
course or anything else.
Remember there are hundreds of millions of

69
00:06:53,183 --> 00:06:59,066
tweets a day, we can listen to the voice
of the consumer like never before, we can

70
00:06:59,066 --> 00:07:04,489
figure out the sentiments and all these
things, just as we've discussed in our

71
00:07:04,489 --> 00:07:08,515
example.
But how do we figure out what consumers

72
00:07:08,515 --> 00:07:14,829
are saying or complaining about, not just
whether or not they are complaining?

73
00:07:14,829 --> 00:07:21,601
What is the object of their complaint, or
for that matter their request or demand?

74
00:07:21,601 --> 00:07:28,226
Consider a comment such as, 'Book me an
American flight to New York.' What does

75
00:07:28,226 --> 00:07:33,183
the word American mean?
Does it mean the airline, American

76
00:07:33,183 --> 00:07:36,678
Airlines?
Or does it mean the nationality of the

77
00:07:36,678 --> 00:07:41,306
airline, so that any airline of American
origin will do.

78
00:07:41,306 --> 00:07:48,171
Obviously, this is an ambiguous sentence
and language is full of such vagueness and

79
00:07:48,171 --> 00:07:56,682
ambiguities.
Suppose the writer also said, 'I hate

80
00:07:56,682 --> 00:08:05,872
British food.' Maybe the guess is now it's
probability American Airline because

81
00:08:05,872 --> 00:08:12,470
British Airways is also another airline
and maybe they're talking about the food

82
00:08:12,470 --> 00:08:17,894
on British Airways.
But suppose the comment was, 'I hate

83
00:08:17,894 --> 00:08:24,092
English food', well suddenly you change
your decision and now he is thinking of

84
00:08:24,092 --> 00:08:30,501
any American carrier, not just American
airlines, because American versus English

85
00:08:30,501 --> 00:08:36,824
clearly distinguishes the fact that he is
talking about the nationality versus

86
00:08:36,824 --> 00:08:45,612
American versus British means that is more
likely to be talking about the carriers

87
00:08:45,612 --> 00:08:51,100
themselves.
Consider this sentence, 'I only eat

88
00:08:51,100 --> 00:08:58,384
Kellogg cereals' verses only 'I eat
Kellogg cereals.' Two very different

89
00:08:58,384 --> 00:09:02,219
things.
What can you say about this home's

90
00:09:02,219 --> 00:09:08,436
breakfast stockpile?
Clearly in the first case it's possibly

91
00:09:08,436 --> 00:09:13,395
saying that, that person really wants to
eat only Kellogg's.

92
00:09:13,395 --> 00:09:18,775
In the second case, he's saying, maybe he
wants to eat Kellogg's, but the rest of

93
00:09:18,775 --> 00:09:25,216
his family just doesn't like it.
Two very different meanings.

94
00:09:25,216 --> 00:09:31,642
'Took the new car on terribly bumpy road.'
It did well though.

95
00:09:31,642 --> 00:09:38,492
Is this family happy with their new car?
Just looking at sentiment, it has so many

96
00:09:38,492 --> 00:09:47,948
negative words - terrible, bumpy.
It does have this positive word well, but

97
00:09:47,948 --> 00:09:55,212
would the basin classifier guess that this
is a positive or a negative comment

98
00:09:55,212 --> 00:09:57,354
properly?
Probably not.

99
00:09:57,354 --> 00:10:03,616
The point we're trying to get at is
Beijing learning using a bag of words,

100
00:10:03,616 --> 00:10:07,791
just features being words themselves.
Is it enough?

101
00:10:07,791 --> 00:10:16,510
And more deeply we're trying to ask the
issue of Richard Montague and Nomchansky

102
00:10:16,510 --> 00:10:24,262
how do we actually discern the meaning of
a sentence versus just classifying it as

103
00:10:24,262 --> 00:10:28,026
positive, negative, good or bad, yin or