1
00:00:00,000 --> 00:00:06,212
In the web, one often encounters the long
tail phenomena, which has to do with the
2
00:00:06,212 --> 00:00:13,122
fact that data is very high dimensional.
For an example, think of the recommender
3
00:00:13,122 --> 00:00:17,680
systems that one encounters while browsing
books on Amazon.
4
00:00:18,720 --> 00:00:23,900
The system tells us which books are
frequently bought together.
5
00:00:24,320 --> 00:00:31,610
It also tells us that the customers who
bought the book that we are browsing also
6
00:00:31,610 --> 00:00:34,100
bought some other books.
Now.
7
00:00:35,900 --> 00:00:41,521
The important thing to note is that no
particular set of books in the web which
8
00:00:41,521 --> 00:00:45,175
has so many, many different books has a
high support.
9
00:00:45,175 --> 00:00:50,727
In fact in the limit, for any combination
of books the support is close to zero.
10
00:00:50,727 --> 00:00:55,997
There are just so many books that no
particular combination will have high
11
00:00:55,997 --> 00:01:01,950
support.
So, association rule mining simply can't
12
00:01:01,950 --> 00:01:11,157
tell us which books are bought together.
Further, things like the customers who
13
00:01:11,157 --> 00:01:16,020
bought this item also bought other items.
Well,
14
00:01:16,400 --> 00:01:23,964
Customers need to be compared.
And people have varied interests.
15
00:01:23,964 --> 00:01:31,064
So, how do we choose amongst the books
that the customers who have bought this
16
00:01:31,064 --> 00:01:37,161
book have also bought.
We need some way of choosing from those
17
00:01:37,161 --> 00:01:42,763
many choices that each of those customers
have made, so what's actually going on
18
00:01:42,763 --> 00:01:46,055
here.
It's not traditional rule mining, it's not
19
00:01:46,055 --> 00:01:51,867
traditional clustering, one is not really
looking at books that are similar to each
20
00:01:51,867 --> 00:01:56,280
other in terms of their content or the
words that they contain.
21
00:01:56,940 --> 00:02:01,759
Yet,
One is looking at books which are similar
22
00:02:01,759 --> 00:02:06,679
because of some other reasons.
They all happen to deal with the same
23
00:02:06,679 --> 00:02:14,565
subject here merely because the people who
bought them were similar and they bought
24
00:02:14,565 --> 00:02:18,771
similar books.
At the same time, we've filtered out the
25
00:02:18,771 --> 00:02:21,062
fact that the people have varied
interests.
26
00:02:21,062 --> 00:02:25,271
So, somebody interested in cloud computing
might also be interested in big data,
27
00:02:25,271 --> 00:02:29,267
might also be interested in poetry.
But we don't see poetry books over here.
28
00:02:29,267 --> 00:02:37,447
So what's going on?
Well, as we mentioned long back some of
29
00:02:37,447 --> 00:02:43,979
these data sets are actually bipartite, in
the sense that we have transactions where
30
00:02:43,979 --> 00:02:47,080
people buy books and books are bought by
people.
31
00:02:47,940 --> 00:02:53,180
There are other situations where we have
exactly the same structure.
32
00:02:53,700 --> 00:02:58,940
Documents contain words, and words are
contained in documents.
33
00:02:59,520 --> 00:03:06,760
Similarly,
Our baby experiences animals.
34
00:03:07,380 --> 00:03:14,752
So, in an experience you may see an animal
which has certain features and certain
35
00:03:14,752 --> 00:03:21,586
features are present in different animals.
So for example, four legs is present in
36
00:03:21,586 --> 00:03:25,889
many animals.
Two legs in other sets of animals, etc
37
00:03:25,889 --> 00:03:33,675
cetera.
In general, we have observations and
38
00:03:33,675 --> 00:03:38,120
perceptions.
Our observations are grouped together.
39
00:03:39,100 --> 00:03:44,895
And each observation consists of many
perceptions and each perception occurs in
40
00:03:44,895 --> 00:03:50,822
many observations.
So the question we're trying to ask now is
41
00:03:50,822 --> 00:03:59,397
whether we can do something more with the
structure and techniques such as
42
00:03:59,397 --> 00:04:06,288
Collaborative Filtering, Latent Semantic
Models,,and and others are actually able
43
00:04:06,288 --> 00:04:13,180
to extract the hidden structure between
different elements of such transactions
44
00:04:13,440 --> 00:04:21,580
and essentially extract classes and
features from such a graph.
45
00:04:21,920 --> 00:04:28,089
So in this case, for example, books are
being characterized not by the words that
46
00:04:28,089 --> 00:04:31,946
they contain but by the people who bought
them.
47
00:04:31,946 --> 00:04:38,115
And people in turns are being are being
characterized by the books that they've
48
00:04:38,115 --> 00:04:41,817
bought.
So for people, books are the features; for
49
00:04:41,817 --> 00:04:47,062
books, people are the features.
Similarly, for documents, words are the
50
00:04:47,062 --> 00:04:50,610
features; for words, documents are the
features.
51
00:04:50,610 --> 00:04:57,495
Let's go on to see how one might model
this bipartite problem.
52
00:04:57,495 --> 00:05:04,395
And on earth the hidden structure where
classes and features are in some sense
53
00:05:04,395 --> 00:05:09,112
interchangeable.
There are in fact many ways to approach
54
00:05:09,112 --> 00:05:13,988
the business of uncovering latent models
or hidden structure.
55
00:05:13,988 --> 00:05:20,543
We'll talk about one particular way which
is based on a matrix formulation of the
56
00:05:20,543 --> 00:05:25,500
problem there are many other ways which
we'll mention shortly.
57
00:05:25,900 --> 00:05:34,256
But, this way talks about matrices. So, if
you remember a matrix is an extension of a
58
00:05:34,256 --> 00:05:40,526
vector so it's a vector of vectors.
So, we can think about this matrix mn
59
00:05:40,527 --> 00:05:47,211
matrix as having N column vectors each of
size M or M row vectors each of size N.
60
00:05:47,211 --> 00:05:52,493
And these are just numbers.
In our case, we will have a situation
61
00:05:52,493 --> 00:05:59,342
where the numbers will be just 1s and 0s.
For example, a column here could represent
62
00:05:59,342 --> 00:06:06,009
a document where each entry is a one or a
zero depending on which of the M possible
63
00:06:06,009 --> 00:06:09,560
words in the universe occur in that
document.
64
00:06:10,080 --> 00:06:18,589
And so every document becomes a column.
Similarly, a row could represent a word
65
00:06:18,589 --> 00:06:25,201
with a one or a zero occurring on, based
on which document that word actually
66
00:06:25,201 --> 00:06:31,901
occurs in.
Our approach to Layton models relies on
67
00:06:31,901 --> 00:06:36,720
factoring this matrix approximately into
smaller matrices.
68
00:06:37,640 --> 00:06:43,705
And to explain shortly how that is done,
but the point to note is that the smaller
69
00:06:43,705 --> 00:06:49,540
matrix is again of size M in terms of
number of rows but has only K columns.
70
00:06:49,880 --> 00:06:57,875
And to make the matrix multiplication work
we have to have this second matrix which
71
00:06:57,875 --> 00:07:06,781
has K rows and N columns, so an mk matrix
into a kn matrix yields an mn matrix
72
00:07:06,781 --> 00:07:13,590
which is of the right structure.
Go back to your matrix multiplication if
73
00:07:13,590 --> 00:07:17,480
you've forgotten it, because it will come
in handy very shortly.
74
00:07:18,360 --> 00:07:23,528
Now let's assume that we actually had a
factorization of this matrix.
75
00:07:23,528 --> 00:07:29,520
That is, this matrix say is approximately
Xy What does this actually mean?
76
00:07:29,880 --> 00:07:34,904
Let's take a look at the column.
Where we see the last column of this
77
00:07:34,904 --> 00:07:40,311
matrix A.
In matrix multiplication terms, how does
78
00:07:40,311 --> 00:07:48,535
this column arise from those of X and Y?
Well, the last column of A arises from the
79
00:07:48,535 --> 00:07:53,777
last column of Y.
And the last column of Y tells us which
80
00:07:53,777 --> 00:07:58,641
combination.
Or how to combine the columns of X to make
81
00:07:58,641 --> 00:08:03,588
this last column.
So for example if you had one zero one
82
00:08:03,588 --> 00:08:10,846
zero one zero this would tell us exactly
which columns of X to add up in order to
83
00:08:10,846 --> 00:08:19,251
make up this last column of A.
Now let's see what that means in terms of
84
00:08:19,251 --> 00:08:23,703
documents and words.
So if you had the last column indicating a
85
00:08:23,703 --> 00:08:32,624
document it tells us that this document is
made up of a certain set of columns of X,
86
00:08:32,624 --> 00:08:40,386
not all of them, a certain set.
So you could get this, this column of A,
87
00:08:40,386 --> 00:08:44,960
the last one by adding up certain columns
of X.
88
00:08:46,000 --> 00:08:51,265
The columns of X are again certain
combinations of works,
89
00:08:51,265 --> 00:08:58,563
In this case some of these may have large
entries, some of these may have small
90
00:08:58,563 --> 00:08:59,580
entries.
So,
91
00:09:00,420 --> 00:09:07,657
A set of words each of different weight is
in some sense a topic.
92
00:09:07,657 --> 00:09:14,180
A topic about cloud computing may have the
word cloud, computing, Amazon,
93
00:09:14,180 --> 00:09:20,980
virtualization and other such words.
It might not have anything to do with
94
00:09:21,960 --> 00:09:27,640
thunderclouds or lightning or rivers or
rain.
95
00:09:28,700 --> 00:09:35,554
So those words would be of lower weighted
in that topic and the former ones would
96
00:09:35,554 --> 00:09:40,933
have higher weight.
So we can view this document as a weighted
97
00:09:40,933 --> 00:09:47,961
combination of topics with this particular
column telling us the weights that is
98
00:09:47,961 --> 00:09:52,560
which topic occurs in this document with
what weight.
99
00:09:53,140 --> 00:10:01,402
Exactly the same formulation works for
people buying books. So in this case an
100
00:10:01,402 --> 00:10:09,251
entry of A is one if this particular
person identified by a certain row of A
101
00:10:09,251 --> 00:10:17,100
buys a particular book and a book is
identified by the people that bought it.
102
00:10:19,140 --> 00:10:25,246
Again, let's look at what the last column
of A looks like.
103
00:10:25,246 --> 00:10:30,173
It's a linear combination of the columns
of X.
104
00:10:30,173 --> 00:10:36,280
And each column of X comprises of a set of
people buying.
105
00:10:37,740 --> 00:10:43,725
Pattern of certain people and in some
sense one can view this linear combination
106
00:10:43,725 --> 00:10:50,823
as linear combination of Jonners.
Another way of looking at this, is to look
107
00:10:50,823 --> 00:10:56,134
at it row-wise.
So one might say that the first row of A
108
00:10:56,134 --> 00:11:00,308
comprises of a combination of the rows of
Y.
109
00:11:00,308 --> 00:11:07,990
With this particular row of X telling us
what combination of rows of Y we should
110
00:11:07,990 --> 00:11:14,440
take to make up this first row of A.
Coming back to people and books,
111
00:11:14,960 --> 00:11:21,017
This particular person represented by the
first row of A can be thought of as
112
00:11:21,017 --> 00:11:27,462
playing multiple roles which are the rows
of Y And the degree to which that person
113
00:11:27,462 --> 00:11:31,500
plays a particular role.
It could be a poetry reader.
114
00:11:31,500 --> 00:11:36,703
It could be a technical reader.
It could be a law reader or someone
115
00:11:36,703 --> 00:11:43,554
interested in science fiction, to the
extent that each of these roles is present
116
00:11:43,554 --> 00:11:49,951
in this person is determined by the weight
that the first role of X tells us about
117
00:11:49,951 --> 00:11:53,420
what the roles of this particular person
are.
118
00:11:55,240 --> 00:12:03,385
So to summarize, we have generalized the
problem of latent models to one of finding
119
00:12:03,385 --> 00:12:09,175
this factorization.
Because once we have this factorization,
120
00:12:09,175 --> 00:12:14,867
we can think of the X matrix as topics or
genres of books.
121
00:12:14,867 --> 00:12:23,381
And the Y matrix as roles of people.
Now in order to do such a factorization,
122
00:12:23,381 --> 00:12:30,733
the matrix A needs to be written in the
form XY but because these are smaller
123
00:12:30,733 --> 00:12:34,742
matrices, this is almost always an
approximation.
124
00:12:34,742 --> 00:12:40,340
So we need some way of deciding what kind
of approximation to make.
125
00:12:41,160 --> 00:12:49,221
One way is to minimize some measure of the
difference between A and XY.
126
00:12:49,222 --> 00:12:55,654
The F norm or Frobenius norm is just
saying, find the difference between A and
127
00:12:55,654 --> 00:13:00,360
X and Y, and take the sum of the squares
of the elements.
128
00:13:01,100 --> 00:13:09,283
There are other ways of doing this
minimization by for example choosing X to
129
00:13:09,283 --> 00:13:16,523
be a particular set of columns of A.
And Y to be another set of columns of A,
130
00:13:16,523 --> 00:13:23,185
rows of A, and finding out which selection
of columns of A and rows of A one can
131
00:13:23,185 --> 00:13:30,012
choose to then minimize this difference.
We don't have to necessarily find X and Y
132
00:13:30,012 --> 00:13:34,592
from scratch, one can constrain them to be
columns of A.
133
00:13:34,592 --> 00:13:41,170
You'll get different factorization, but
either way it seems to work pretty well in
134
00:13:41,170 --> 00:13:46,576
practice.
The constraint is that these entries need
135
00:13:46,576 --> 00:13:52,443
to be non-negative because it doesn't
really make sense to have negative
136
00:13:52,443 --> 00:13:56,221
weights.
So you can't have a person playing +0.5
137
00:13:56,221 --> 00:14:01,043
times science fiction reader and -0.3
times technical reader.
138
00:14:01,043 --> 00:14:05,544
It doesn't make sense.
So positive weights are essential.
139
00:14:05,584 --> 00:14:11,412
And so all the increases are supposed to
be non-negative so various types of
140
00:14:11,412 --> 00:14:15,270
non-negative matrix factorizations are
possible.
141
00:14:15,270 --> 00:14:20,204
Some compute x and y from scratch to
minimize this kind of a norm.
142
00:14:20,204 --> 00:14:25,960
Some choose X and Y from amongst the
columns of A And in fact, there are many
143
00:14:25,960 --> 00:14:30,820
ways that one can come up with a
non-negative matrix factorization.
144
00:14:31,860 --> 00:14:39,717
There are packages available that compute
the non-negative matrix factorization in
145
00:14:39,717 --> 00:14:45,208
particular ways.
And, we can easily use these algorithms to
146
00:14:45,208 --> 00:14:51,457
find topics, or find roles, or do
recommendation systems, such as in
147
00:14:51,457 --> 00:14:56,380
recommending books and movies in Amazon
and Netflix.
148
00:14:57,100 --> 00:15:04,481
In fact, there was a prize for being able
to predict Netflix's ratings a few years
149
00:15:04,481 --> 00:15:09,280
ago and matrix techniques ended up being
the winner.
150
00:15:12,200 --> 00:15:18,342
Matrix techniques are not the only way of
finding latent models.
151
00:15:18,342 --> 00:15:24,900
There are other probabilistic ways such as
the LDA or latent direct allocation.
152
00:15:24,900 --> 00:15:31,360
There are other matrix techniques like S
singular value decomposition but these are
153
00:15:31,360 --> 00:15:35,898
things we don't have time to get into in
this course.
154
00:15:35,898 --> 00:15:41,128
I felt that the matrix approach makes
things extremely clear and also allows one
155
00:15:41,128 --> 00:15:46,973
to use a pre written package to easily
find the factorization and therefore
156
00:15:46,973 --> 00:15:50,280
determine topics from a bunch of
documents.
157
00:15:50,280 --> 00:15:56,280
All one needs to do is formulate the
matrix A and let the package do the rest.