1
00:00:00,000 --> 00:00:06,212
In the web, one often encounters the long
tail phenomena, which has to do with the

2
00:00:06,212 --> 00:00:13,122
fact that data is very high dimensional.
For an example, think of the recommender

3
00:00:13,122 --> 00:00:17,680
systems that one encounters while browsing
books on Amazon.

4
00:00:18,720 --> 00:00:23,900
The system tells us which books are
frequently bought together.

5
00:00:24,320 --> 00:00:31,610
It also tells us that the customers who
bought the book that we are browsing also

6
00:00:31,610 --> 00:00:34,100
bought some other books.
Now.

7
00:00:35,900 --> 00:00:41,521
The important thing to note is that no
particular set of books in the web which

8
00:00:41,521 --> 00:00:45,175
has so many, many different books has a
high support.

9
00:00:45,175 --> 00:00:50,727
In fact in the limit, for any combination
of books the support is close to zero.

10
00:00:50,727 --> 00:00:55,997
There are just so many books that no
particular combination will have high

11
00:00:55,997 --> 00:01:01,950
support.
So, association rule mining simply can't

12
00:01:01,950 --> 00:01:11,157
tell us which books are bought together.
Further, things like the customers who

13
00:01:11,157 --> 00:01:16,020
bought this item also bought other items.
Well,

14
00:01:16,400 --> 00:01:23,964
Customers need to be compared.
And people have varied interests.

15
00:01:23,964 --> 00:01:31,064
So, how do we choose amongst the books
that the customers who have bought this

16
00:01:31,064 --> 00:01:37,161
book have also bought.
We need some way of choosing from those

17
00:01:37,161 --> 00:01:42,763
many choices that each of those customers
have made, so what's actually going on

18
00:01:42,763 --> 00:01:46,055
here.
It's not traditional rule mining, it's not

19
00:01:46,055 --> 00:01:51,867
traditional clustering, one is not really
looking at books that are similar to each

20
00:01:51,867 --> 00:01:56,280
other in terms of their content or the
words that they contain.

21
00:01:56,940 --> 00:02:01,759
Yet,
One is looking at books which are similar

22
00:02:01,759 --> 00:02:06,679
because of some other reasons.
They all happen to deal with the same

23
00:02:06,679 --> 00:02:14,565
subject here merely because the people who
bought them were similar and they bought

24
00:02:14,565 --> 00:02:18,771
similar books.
At the same time, we've filtered out the

25
00:02:18,771 --> 00:02:21,062
fact that the people have varied
interests.

26
00:02:21,062 --> 00:02:25,271
So, somebody interested in cloud computing
might also be interested in big data,

27
00:02:25,271 --> 00:02:29,267
might also be interested in poetry.
But we don't see poetry books over here.

28
00:02:29,267 --> 00:02:37,447
So what's going on?
Well, as we mentioned long back some of

29
00:02:37,447 --> 00:02:43,979
these data sets are actually bipartite, in
the sense that we have transactions where

30
00:02:43,979 --> 00:02:47,080
people buy books and books are bought by
people.

31
00:02:47,940 --> 00:02:53,180
There are other situations where we have
exactly the same structure.

32
00:02:53,700 --> 00:02:58,940
Documents contain words, and words are
contained in documents.

33
00:02:59,520 --> 00:03:06,760
Similarly,
Our baby experiences animals.

34
00:03:07,380 --> 00:03:14,752
So, in an experience you may see an animal
which has certain features and certain

35
00:03:14,752 --> 00:03:21,586
features are present in different animals.
So for example, four legs is present in

36
00:03:21,586 --> 00:03:25,889
many animals.
Two legs in other sets of animals, etc

37
00:03:25,889 --> 00:03:33,675
cetera.
In general, we have observations and

38
00:03:33,675 --> 00:03:38,120
perceptions.
Our observations are grouped together.

39
00:03:39,100 --> 00:03:44,895
And each observation consists of many
perceptions and each perception occurs in

40
00:03:44,895 --> 00:03:50,822
many observations.
So the question we're trying to ask now is

41
00:03:50,822 --> 00:03:59,397
whether we can do something more with the
structure and techniques such as

42
00:03:59,397 --> 00:04:06,288
Collaborative Filtering, Latent Semantic
Models,,and and others are actually able

43
00:04:06,288 --> 00:04:13,180
to extract the hidden structure between
different elements of such transactions

44
00:04:13,440 --> 00:04:21,580
and essentially extract classes and
features from such a graph.

45
00:04:21,920 --> 00:04:28,089
So in this case, for example, books are
being characterized not by the words that

46
00:04:28,089 --> 00:04:31,946
they contain but by the people who bought
them.

47
00:04:31,946 --> 00:04:38,115
And people in turns are being are being
characterized by the books that they've

48
00:04:38,115 --> 00:04:41,817
bought.
So for people, books are the features; for

49
00:04:41,817 --> 00:04:47,062
books, people are the features.
Similarly, for documents, words are the

50
00:04:47,062 --> 00:04:50,610
features; for words, documents are the
features.

51
00:04:50,610 --> 00:04:57,495
Let's go on to see how one might model
this bipartite problem.

52
00:04:57,495 --> 00:05:04,395
And on earth the hidden structure where
classes and features are in some sense

53
00:05:04,395 --> 00:05:09,112
interchangeable.
There are in fact many ways to approach

54
00:05:09,112 --> 00:05:13,988
the business of uncovering latent models
or hidden structure.

55
00:05:13,988 --> 00:05:20,543
We'll talk about one particular way which
is based on a matrix formulation of the

56
00:05:20,543 --> 00:05:25,500
problem there are many other ways which
we'll mention shortly.

57
00:05:25,900 --> 00:05:34,256
But, this way talks about matrices. So, if
you remember a matrix is an extension of a

58
00:05:34,256 --> 00:05:40,526
vector so it's a vector of vectors.
So, we can think about this matrix m<i>n</i>

59
00:05:40,527 --> 00:05:47,211
matrix as having N column vectors each of
size M or M row vectors each of size N.

60
00:05:47,211 --> 00:05:52,493
And these are just numbers.
In our case, we will have a situation

61
00:05:52,493 --> 00:05:59,342
where the numbers will be just 1s and 0s.
For example, a column here could represent

62
00:05:59,342 --> 00:06:06,009
a document where each entry is a one or a
zero depending on which of the M possible

63
00:06:06,009 --> 00:06:09,560
words in the universe occur in that
document.

64
00:06:10,080 --> 00:06:18,589
And so every document becomes a column.
Similarly, a row could represent a word

65
00:06:18,589 --> 00:06:25,201
with a one or a zero occurring on, based
on which document that word actually

66
00:06:25,201 --> 00:06:31,901
occurs in.
Our approach to Layton models relies on

67
00:06:31,901 --> 00:06:36,720
factoring this matrix approximately into
smaller matrices.

68
00:06:37,640 --> 00:06:43,705
And to explain shortly how that is done,
but the point to note is that the smaller

69
00:06:43,705 --> 00:06:49,540
matrix is again of size M in terms of
number of rows but has only K columns.

70
00:06:49,880 --> 00:06:57,875
And to make the matrix multiplication work
we have to have this second matrix which

71
00:06:57,875 --> 00:07:06,781
has K rows and N columns, so an m<i>k matrix
into a k<i>n matrix yields an m<i>n matrix</i></i></i>

72
00:07:06,781 --> 00:07:13,590
which is of the right structure.
Go back to your matrix multiplication if

73
00:07:13,590 --> 00:07:17,480
you've forgotten it, because it will come
in handy very shortly.

74
00:07:18,360 --> 00:07:23,528
Now let's assume that we actually had a
factorization of this matrix.

75
00:07:23,528 --> 00:07:29,520
That is, this matrix say is approximately
X<i>y What does this actually mean?</i>

76
00:07:29,880 --> 00:07:34,904
Let's take a look at the column.
Where we see the last column of this

77
00:07:34,904 --> 00:07:40,311
matrix A.
In matrix multiplication terms, how does

78
00:07:40,311 --> 00:07:48,535
this column arise from those of X and Y?
Well, the last column of A arises from the

79
00:07:48,535 --> 00:07:53,777
last column of Y.
And the last column of Y tells us which

80
00:07:53,777 --> 00:07:58,641
combination.
Or how to combine the columns of X to make

81
00:07:58,641 --> 00:08:03,588
this last column.
So for example if you had one zero one

82
00:08:03,588 --> 00:08:10,846
zero one zero this would tell us exactly
which columns of X to add up in order to

83
00:08:10,846 --> 00:08:19,251
make up this last column of A.
Now let's see what that means in terms of

84
00:08:19,251 --> 00:08:23,703
documents and words.
So if you had the last column indicating a

85
00:08:23,703 --> 00:08:32,624
document it tells us that this document is
made up of a certain set of columns of X,

86
00:08:32,624 --> 00:08:40,386
not all of them, a certain set.
So you could get this, this column of A,

87
00:08:40,386 --> 00:08:44,960
the last one by adding up certain columns
of X.

88
00:08:46,000 --> 00:08:51,265
The columns of X are again certain
combinations of works,

89
00:08:51,265 --> 00:08:58,563
In this case some of these may have large
entries, some of these may have small

90
00:08:58,563 --> 00:08:59,580
entries.
So,

91
00:09:00,420 --> 00:09:07,657
A set of words each of different weight is
in some sense a topic.

92
00:09:07,657 --> 00:09:14,180
A topic about cloud computing may have the
word cloud, computing, Amazon,

93
00:09:14,180 --> 00:09:20,980
virtualization and other such words.
It might not have anything to do with

94
00:09:21,960 --> 00:09:27,640
thunderclouds or lightning or rivers or
rain.

95
00:09:28,700 --> 00:09:35,554
So those words would be of lower weighted
in that topic and the former ones would

96
00:09:35,554 --> 00:09:40,933
have higher weight.
So we can view this document as a weighted

97
00:09:40,933 --> 00:09:47,961
combination of topics with this particular
column telling us the weights that is

98
00:09:47,961 --> 00:09:52,560
which topic occurs in this document with
what weight.

99
00:09:53,140 --> 00:10:01,402
Exactly the same formulation works for
people buying books. So in this case an

100
00:10:01,402 --> 00:10:09,251
entry of A is one if this particular
person identified by a certain row of A

101
00:10:09,251 --> 00:10:17,100
buys a particular book and a book is
identified by the people that bought it.

102
00:10:19,140 --> 00:10:25,246
Again, let's look at what the last column
of A looks like.

103
00:10:25,246 --> 00:10:30,173
It's a linear combination of the columns
of X.

104
00:10:30,173 --> 00:10:36,280
And each column of X comprises of a set of
people buying.

105
00:10:37,740 --> 00:10:43,725
Pattern of certain people and in some
sense one can view this linear combination

106
00:10:43,725 --> 00:10:50,823
as linear combination of Jonners.
Another way of looking at this, is to look

107
00:10:50,823 --> 00:10:56,134
at it row-wise.
So one might say that the first row of A

108
00:10:56,134 --> 00:11:00,308
comprises of a combination of the rows of
Y.

109
00:11:00,308 --> 00:11:07,990
With this particular row of X telling us
what combination of rows of Y we should

110
00:11:07,990 --> 00:11:14,440
take to make up this first row of A.
Coming back to people and books,

111
00:11:14,960 --> 00:11:21,017
This particular person represented by the
first row of A can be thought of as

112
00:11:21,017 --> 00:11:27,462
playing multiple roles which are the rows
of Y And the degree to which that person

113
00:11:27,462 --> 00:11:31,500
plays a particular role.
It could be a poetry reader.

114
00:11:31,500 --> 00:11:36,703
It could be a technical reader.
It could be a law reader or someone

115
00:11:36,703 --> 00:11:43,554
interested in science fiction, to the
extent that each of these roles is present

116
00:11:43,554 --> 00:11:49,951
in this person is determined by the weight
that the first role of X tells us about

117
00:11:49,951 --> 00:11:53,420
what the roles of this particular person
are.

118
00:11:55,240 --> 00:12:03,385
So to summarize, we have generalized the
problem of latent models to one of finding

119
00:12:03,385 --> 00:12:09,175
this factorization.
Because once we have this factorization,

120
00:12:09,175 --> 00:12:14,867
we can think of the X matrix as topics or
genres of books.

121
00:12:14,867 --> 00:12:23,381
And the Y matrix as roles of people.
Now in order to do such a factorization,

122
00:12:23,381 --> 00:12:30,733
the matrix A needs to be written in the
form X<i>Y but because these are smaller</i>

123
00:12:30,733 --> 00:12:34,742
matrices, this is almost always an
approximation.

124
00:12:34,742 --> 00:12:40,340
So we need some way of deciding what kind
of approximation to make.

125
00:12:41,160 --> 00:12:49,221
One way is to minimize some measure of the
difference between A and X<i>Y.</i>

126
00:12:49,222 --> 00:12:55,654
The F norm or Frobenius norm is just
saying, find the difference between A and

127
00:12:55,654 --> 00:13:00,360
X and Y, and take the sum of the squares
of the elements.

128
00:13:01,100 --> 00:13:09,283
There are other ways of doing this
minimization by for example choosing X to

129
00:13:09,283 --> 00:13:16,523
be a particular set of columns of A.
And Y to be another set of columns of A,

130
00:13:16,523 --> 00:13:23,185
rows of A, and finding out which selection
of columns of A and rows of A one can

131
00:13:23,185 --> 00:13:30,012
choose to then minimize this difference.
We don't have to necessarily find X and Y

132
00:13:30,012 --> 00:13:34,592
from scratch, one can constrain them to be
columns of A.

133
00:13:34,592 --> 00:13:41,170
You'll get different factorization, but
either way it seems to work pretty well in

134
00:13:41,170 --> 00:13:46,576
practice.
The constraint is that these entries need

135
00:13:46,576 --> 00:13:52,443
to be non-negative because it doesn't
really make sense to have negative

136
00:13:52,443 --> 00:13:56,221
weights.
So you can't have a person playing +0.5

137
00:13:56,221 --> 00:14:01,043
times science fiction reader and -0.3
times technical reader.

138
00:14:01,043 --> 00:14:05,544
It doesn't make sense.
So positive weights are essential.

139
00:14:05,584 --> 00:14:11,412
And so all the increases are supposed to
be non-negative so various types of

140
00:14:11,412 --> 00:14:15,270
non-negative matrix factorizations are
possible.

141
00:14:15,270 --> 00:14:20,204
Some compute x and y from scratch to
minimize this kind of a norm.

142
00:14:20,204 --> 00:14:25,960
Some choose X and Y from amongst the
columns of A And in fact, there are many

143
00:14:25,960 --> 00:14:30,820
ways that one can come up with a
non-negative matrix factorization.

144
00:14:31,860 --> 00:14:39,717
There are packages available that compute
the non-negative matrix factorization in

145
00:14:39,717 --> 00:14:45,208
particular ways.
And, we can easily use these algorithms to

146
00:14:45,208 --> 00:14:51,457
find topics, or find roles, or do
recommendation systems, such as in

147
00:14:51,457 --> 00:14:56,380
recommending books and movies in Amazon
and Netflix.

148
00:14:57,100 --> 00:15:04,481
In fact, there was a prize for being able
to predict Netflix's ratings a few years

149
00:15:04,481 --> 00:15:09,280
ago and matrix techniques ended up being
the winner.

150
00:15:12,200 --> 00:15:18,342
Matrix techniques are not the only way of
finding latent models.

151
00:15:18,342 --> 00:15:24,900
There are other probabilistic ways such as
the LDA or latent direct allocation.

152
00:15:24,900 --> 00:15:31,360
There are other matrix techniques like S
singular value decomposition but these are

153
00:15:31,360 --> 00:15:35,898
things we don't have time to get into in
this course.

154
00:15:35,898 --> 00:15:41,128
I felt that the matrix approach makes
things extremely clear and also allows one

155
00:15:41,128 --> 00:15:46,973
to use a pre written package to easily
find the factorization and therefore

156
00:15:46,973 --> 00:15:50,280
determine topics from a bunch of
documents.

157
00:15:50,280 --> 00:15:56,280
All one needs to do is formulate the
matrix A and let the package do the rest.