So it appears that mutual information
tells us which features, words in our
example earlier, are good predictors of
the behavior we want to predict, should we
not use those with the highest mutual
information, as our features.
The trouble is the actual mutual
information from the formula is very
difficult to compute exhaustively.
There are so many different possibilities
with large numbers of features.
So in practice we use proxies.
Well, a good proxy that we've seen earlier
is the Universe Document Frequency.
There are other techniques.
Adaboost, in particular, is an important
algorithm.
But we won't go into that in detail in
this course.
For the moment, think about using those
words with high English document frequency
as a proxy for words which are likely to
be good features.
Another question we might ask is, are more
features always good?
The Y axis here is measuring the error in
the classification, that is how often
naive base gets the wrong answer.
This error improves as we add features,
but after some time it starts to degrade
again.
Why might this be happening?
What do you think?
Perhaps we are using the wrong features to
start with.
It turns out that, that's not the whole
story either.
In this example the features having the
lowest, mutual information, or information
gain which is another term for the same
idea.
It's those bad features which are used
first and then the good features come
later still the classifier goes a wire.
So what's going on?
Can you guess?
Remember that there is a reason why naive
base is called naive.
It doesn't like redundant features.
It assumes that features are independent.
It likes the fact that features have very
small mutual information amongst
themselves.
The trouble is that's not always the case.
And this is one reason why the technique
can fail.
It gets confused because it assumes that
features are independent.
Well, in principle, one should be able to
compute the best features, either by
computing the mutual information directly
or using a proxy, as well as somehow
figuring out which features are dependent.
And choose those best features which are
also dependent.
Many, many machine learning techniques do
exactly this.
We don't have time to go into the
techniques in detail.
But, the idea should be clear by now.
Let's return now to looking at machine
learning from the perspective of
information theory.
We have a machine learning algorithm which
takes the sequence of observations such as
comments and classifies them as positive
or negative.
In a manner so as to maximize the mutual
information between the observations and
their actual classifications versus the
ones that the algorithm manages to
predict.
Relating this to what Shannon defined as
the capacity of a communication channel in
his this was an actual communication
channel if you remember like a telephone
channel or a radio channel.
He, Luke, was worried about how fast you
could transmit information on such
channel.
So, he defined the capacity of a channel
as the maximum information that could be
transferred between the sender and
receiver per second.
So, the element of speed comes in when you
talk about capacity.
What does this mean in the context of
machine learning?
Is there an equivalent, notion of
capacity?
How, fast can a machine algorithm actually
learn?
Or what does it mean to be fast?
It turns out that there have been a lot of
work on the theory of machine learning.
The pioneer on this is Leslie Valiant who
won the touring award in 2011.
And other important papers defined
something called the VC dimension using
which it was shown in this paper the right
bayesian classifier will eventually learn
any concept, or any distinction between
plus and minus, yin and yang.
The trouble is it need not run as fast.
What does that mean?
How many training examples does a
classifier require to learn a concept?
That's the equivalent of speed in the
world of machine learning.
And how fast depends on the concept
itself, and the VC dimension or the Vapnik
Chervonenkis Dimension of a concept can be
measured.
Using which, this paper showed that
Bayesian learning can eventually learn any
concept.
And the speed depends on the VC dimension
of the concept.
Well, that's all we're going to do
regarding machine learning theory for the
moment.
Let's return now to the question of
whether sentiment analysis is actually
measuring an opinion about a product, a
course or anything else.
Remember there are hundreds of millions of
tweets a day, we can listen to the voice
of the consumer like never before, we can
figure out the sentiments and all these
things, just as we've discussed in our
example.
But how do we figure out what consumers
are saying or complaining about, not just
whether or not they are complaining?
What is the object of their complaint, or
for that matter their request or demand?
Consider a comment such as, 'Book me an
American flight to New York.' What does
the word American mean?
Does it mean the airline, American
Airlines?
Or does it mean the nationality of the
airline, so that any airline of American
origin will do.
Obviously, this is an ambiguous sentence
and language is full of such vagueness and
ambiguities.
Suppose the writer also said, 'I hate
British food.' Maybe the guess is now it's
probability American Airline because
British Airways is also another airline
and maybe they're talking about the food
on British Airways.
But suppose the comment was, 'I hate
English food', well suddenly you change
your decision and now he is thinking of
any American carrier, not just American
airlines, because American versus English
clearly distinguishes the fact that he is
talking about the nationality versus
American versus British means that is more
likely to be talking about the carriers
themselves.
Consider this sentence, 'I only eat
Kellogg cereals' verses only 'I eat
Kellogg cereals.' Two very different
things.
What can you say about this home's
breakfast stockpile?
Clearly in the first case it's possibly
saying that, that person really wants to
eat only Kellogg's.
In the second case, he's saying, maybe he
wants to eat Kellogg's, but the rest of
his family just doesn't like it.
Two very different meanings.
'Took the new car on terribly bumpy road.'
It did well though.
Is this family happy with their new car?
Just looking at sentiment, it has so many
negative words - terrible, bumpy.
It does have this positive word well, but
would the basin classifier guess that this
is a positive or a negative comment
properly?
Probably not.
The point we're trying to get at is
Beijing learning using a bag of words,
just features being words themselves.
Is it enough?
And more deeply we're trying to ask the
issue of Richard Montague and Nomchansky
how do we actually discern the meaning of
a sentence versus just classifying it as
positive, negative, good or bad, yin or