In the web, one often encounters the long
tail phenomena, which has to do with the
fact that data is very high dimensional.
For an example, think of the recommender
systems that one encounters while browsing
books on Amazon.
The system tells us which books are
frequently bought together.
It also tells us that the customers who
bought the book that we are browsing also
bought some other books.
Now.
The important thing to note is that no
particular set of books in the web which
has so many, many different books has a
high support.
In fact in the limit, for any combination
of books the support is close to zero.
There are just so many books that no
particular combination will have high
support.
So, association rule mining simply can't
tell us which books are bought together.
Further, things like the customers who
bought this item also bought other items.
Well,
Customers need to be compared.
And people have varied interests.
So, how do we choose amongst the books
that the customers who have bought this
book have also bought.
We need some way of choosing from those
many choices that each of those customers
have made, so what's actually going on
here.
It's not traditional rule mining, it's not
traditional clustering, one is not really
looking at books that are similar to each
other in terms of their content or the
words that they contain.
Yet,
One is looking at books which are similar
because of some other reasons.
They all happen to deal with the same
subject here merely because the people who
bought them were similar and they bought
similar books.
At the same time, we've filtered out the
fact that the people have varied
interests.
So, somebody interested in cloud computing
might also be interested in big data,
might also be interested in poetry.
But we don't see poetry books over here.
So what's going on?
Well, as we mentioned long back some of
these data sets are actually bipartite, in
the sense that we have transactions where
people buy books and books are bought by
people.
There are other situations where we have
exactly the same structure.
Documents contain words, and words are
contained in documents.
Similarly,
Our baby experiences animals.
So, in an experience you may see an animal
which has certain features and certain
features are present in different animals.
So for example, four legs is present in
many animals.
Two legs in other sets of animals, etc
cetera.
In general, we have observations and
perceptions.
Our observations are grouped together.
And each observation consists of many
perceptions and each perception occurs in
many observations.
So the question we're trying to ask now is
whether we can do something more with the
structure and techniques such as
Collaborative Filtering, Latent Semantic
Models,,and and others are actually able
to extract the hidden structure between
different elements of such transactions
and essentially extract classes and
features from such a graph.
So in this case, for example, books are
being characterized not by the words that
they contain but by the people who bought
them.
And people in turns are being are being
characterized by the books that they've
bought.
So for people, books are the features; for
books, people are the features.
Similarly, for documents, words are the
features; for words, documents are the
features.
Let's go on to see how one might model
this bipartite problem.
And on earth the hidden structure where
classes and features are in some sense
interchangeable.
There are in fact many ways to approach
the business of uncovering latent models
or hidden structure.
We'll talk about one particular way which
is based on a matrix formulation of the
problem there are many other ways which
we'll mention shortly.
But, this way talks about matrices. So, if
you remember a matrix is an extension of a
vector so it's a vector of vectors.
So, we can think about this matrix m<i>n</i>
matrix as having N column vectors each of
size M or M row vectors each of size N.
And these are just numbers.
In our case, we will have a situation
where the numbers will be just 1s and 0s.
For example, a column here could represent
a document where each entry is a one or a
zero depending on which of the M possible
words in the universe occur in that
document.
And so every document becomes a column.
Similarly, a row could represent a word
with a one or a zero occurring on, based
on which document that word actually
occurs in.
Our approach to Layton models relies on
factoring this matrix approximately into
smaller matrices.
And to explain shortly how that is done,
but the point to note is that the smaller
matrix is again of size M in terms of
number of rows but has only K columns.
And to make the matrix multiplication work
we have to have this second matrix which
has K rows and N columns, so an m<i>k matrix
into a k<i>n matrix yields an m<i>n matrix</i></i></i>
which is of the right structure.
Go back to your matrix multiplication if
you've forgotten it, because it will come
in handy very shortly.
Now let's assume that we actually had a
factorization of this matrix.
That is, this matrix say is approximately
X<i>y What does this actually mean?</i>
Let's take a look at the column.
Where we see the last column of this
matrix A.
In matrix multiplication terms, how does
this column arise from those of X and Y?
Well, the last column of A arises from the
last column of Y.
And the last column of Y tells us which
combination.
Or how to combine the columns of X to make
this last column.
So for example if you had one zero one
zero one zero this would tell us exactly
which columns of X to add up in order to
make up this last column of A.
Now let's see what that means in terms of
documents and words.
So if you had the last column indicating a
document it tells us that this document is
made up of a certain set of columns of X,
not all of them, a certain set.
So you could get this, this column of A,
the last one by adding up certain columns
of X.
The columns of X are again certain
combinations of works,
In this case some of these may have large
entries, some of these may have small
entries.
So,
A set of words each of different weight is
in some sense a topic.
A topic about cloud computing may have the
word cloud, computing, Amazon,
virtualization and other such words.
It might not have anything to do with
thunderclouds or lightning or rivers or
rain.
So those words would be of lower weighted
in that topic and the former ones would
have higher weight.
So we can view this document as a weighted
combination of topics with this particular
column telling us the weights that is
which topic occurs in this document with
what weight.
Exactly the same formulation works for
people buying books. So in this case an
entry of A is one if this particular
person identified by a certain row of A
buys a particular book and a book is
identified by the people that bought it.
Again, let's look at what the last column
of A looks like.
It's a linear combination of the columns
of X.
And each column of X comprises of a set of
people buying.
Pattern of certain people and in some
sense one can view this linear combination
as linear combination of Jonners.
Another way of looking at this, is to look
at it row-wise.
So one might say that the first row of A
comprises of a combination of the rows of
Y.
With this particular row of X telling us
what combination of rows of Y we should
take to make up this first row of A.
Coming back to people and books,
This particular person represented by the
first row of A can be thought of as
playing multiple roles which are the rows
of Y And the degree to which that person
plays a particular role.
It could be a poetry reader.
It could be a technical reader.
It could be a law reader or someone
interested in science fiction, to the
extent that each of these roles is present
in this person is determined by the weight
that the first role of X tells us about
what the roles of this particular person
are.
So to summarize, we have generalized the
problem of latent models to one of finding
this factorization.
Because once we have this factorization,
we can think of the X matrix as topics or
genres of books.
And the Y matrix as roles of people.
Now in order to do such a factorization,
the matrix A needs to be written in the
form X<i>Y but because these are smaller</i>
matrices, this is almost always an
approximation.
So we need some way of deciding what kind
of approximation to make.
One way is to minimize some measure of the
difference between A and X<i>Y.</i>
The F norm or Frobenius norm is just
saying, find the difference between A and
X and Y, and take the sum of the squares
of the elements.
There are other ways of doing this
minimization by for example choosing X to
be a particular set of columns of A.
And Y to be another set of columns of A,
rows of A, and finding out which selection
of columns of A and rows of A one can
choose to then minimize this difference.
We don't have to necessarily find X and Y
from scratch, one can constrain them to be
columns of A.
You'll get different factorization, but
either way it seems to work pretty well in
practice.
The constraint is that these entries need
to be non-negative because it doesn't
really make sense to have negative
weights.
So you can't have a person playing +0.5
times science fiction reader and -0.3
times technical reader.
It doesn't make sense.
So positive weights are essential.
And so all the increases are supposed to
be non-negative so various types of
non-negative matrix factorizations are
possible.
Some compute x and y from scratch to
minimize this kind of a norm.
Some choose X and Y from amongst the
columns of A And in fact, there are many
ways that one can come up with a
non-negative matrix factorization.
There are packages available that compute
the non-negative matrix factorization in
particular ways.
And, we can easily use these algorithms to
find topics, or find roles, or do
recommendation systems, such as in
recommending books and movies in Amazon
and Netflix.
In fact, there was a prize for being able
to predict Netflix's ratings a few years
ago and matrix techniques ended up being
the winner.
Matrix techniques are not the only way of
finding latent models.
There are other probabilistic ways such as
the LDA or latent direct allocation.
There are other matrix techniques like S
singular value decomposition but these are
things we don't have time to get into in
this course.
I felt that the matrix approach makes
things extremely clear and also allows one
to use a pre written package to easily
find the factorization and therefore
determine topics from a bunch of
documents.
All one needs to do is formulate the
matrix A and let the package do the rest.