1 00:00:00,000 --> 00:00:06,212 In the web, one often encounters the long tail phenomena, which has to do with the 2 00:00:06,212 --> 00:00:13,122 fact that data is very high dimensional. For an example, think of the recommender 3 00:00:13,122 --> 00:00:17,680 systems that one encounters while browsing books on Amazon. 4 00:00:18,720 --> 00:00:23,900 The system tells us which books are frequently bought together. 5 00:00:24,320 --> 00:00:31,610 It also tells us that the customers who bought the book that we are browsing also 6 00:00:31,610 --> 00:00:34,100 bought some other books. Now. 7 00:00:35,900 --> 00:00:41,521 The important thing to note is that no particular set of books in the web which 8 00:00:41,521 --> 00:00:45,175 has so many, many different books has a high support. 9 00:00:45,175 --> 00:00:50,727 In fact in the limit, for any combination of books the support is close to zero. 10 00:00:50,727 --> 00:00:55,997 There are just so many books that no particular combination will have high 11 00:00:55,997 --> 00:01:01,950 support. So, association rule mining simply can't 12 00:01:01,950 --> 00:01:11,157 tell us which books are bought together. Further, things like the customers who 13 00:01:11,157 --> 00:01:16,020 bought this item also bought other items. Well, 14 00:01:16,400 --> 00:01:23,964 Customers need to be compared. And people have varied interests. 15 00:01:23,964 --> 00:01:31,064 So, how do we choose amongst the books that the customers who have bought this 16 00:01:31,064 --> 00:01:37,161 book have also bought. We need some way of choosing from those 17 00:01:37,161 --> 00:01:42,763 many choices that each of those customers have made, so what's actually going on 18 00:01:42,763 --> 00:01:46,055 here. It's not traditional rule mining, it's not 19 00:01:46,055 --> 00:01:51,867 traditional clustering, one is not really looking at books that are similar to each 20 00:01:51,867 --> 00:01:56,280 other in terms of their content or the words that they contain. 21 00:01:56,940 --> 00:02:01,759 Yet, One is looking at books which are similar 22 00:02:01,759 --> 00:02:06,679 because of some other reasons. They all happen to deal with the same 23 00:02:06,679 --> 00:02:14,565 subject here merely because the people who bought them were similar and they bought 24 00:02:14,565 --> 00:02:18,771 similar books. At the same time, we've filtered out the 25 00:02:18,771 --> 00:02:21,062 fact that the people have varied interests. 26 00:02:21,062 --> 00:02:25,271 So, somebody interested in cloud computing might also be interested in big data, 27 00:02:25,271 --> 00:02:29,267 might also be interested in poetry. But we don't see poetry books over here. 28 00:02:29,267 --> 00:02:37,447 So what's going on? Well, as we mentioned long back some of 29 00:02:37,447 --> 00:02:43,979 these data sets are actually bipartite, in the sense that we have transactions where 30 00:02:43,979 --> 00:02:47,080 people buy books and books are bought by people. 31 00:02:47,940 --> 00:02:53,180 There are other situations where we have exactly the same structure. 32 00:02:53,700 --> 00:02:58,940 Documents contain words, and words are contained in documents. 33 00:02:59,520 --> 00:03:06,760 Similarly, Our baby experiences animals. 34 00:03:07,380 --> 00:03:14,752 So, in an experience you may see an animal which has certain features and certain 35 00:03:14,752 --> 00:03:21,586 features are present in different animals. So for example, four legs is present in 36 00:03:21,586 --> 00:03:25,889 many animals. Two legs in other sets of animals, etc 37 00:03:25,889 --> 00:03:33,675 cetera. In general, we have observations and 38 00:03:33,675 --> 00:03:38,120 perceptions. Our observations are grouped together. 39 00:03:39,100 --> 00:03:44,895 And each observation consists of many perceptions and each perception occurs in 40 00:03:44,895 --> 00:03:50,822 many observations. So the question we're trying to ask now is 41 00:03:50,822 --> 00:03:59,397 whether we can do something more with the structure and techniques such as 42 00:03:59,397 --> 00:04:06,288 Collaborative Filtering, Latent Semantic Models,,and and others are actually able 43 00:04:06,288 --> 00:04:13,180 to extract the hidden structure between different elements of such transactions 44 00:04:13,440 --> 00:04:21,580 and essentially extract classes and features from such a graph. 45 00:04:21,920 --> 00:04:28,089 So in this case, for example, books are being characterized not by the words that 46 00:04:28,089 --> 00:04:31,946 they contain but by the people who bought them. 47 00:04:31,946 --> 00:04:38,115 And people in turns are being are being characterized by the books that they've 48 00:04:38,115 --> 00:04:41,817 bought. So for people, books are the features; for 49 00:04:41,817 --> 00:04:47,062 books, people are the features. Similarly, for documents, words are the 50 00:04:47,062 --> 00:04:50,610 features; for words, documents are the features. 51 00:04:50,610 --> 00:04:57,495 Let's go on to see how one might model this bipartite problem. 52 00:04:57,495 --> 00:05:04,395 And on earth the hidden structure where classes and features are in some sense 53 00:05:04,395 --> 00:05:09,112 interchangeable. There are in fact many ways to approach 54 00:05:09,112 --> 00:05:13,988 the business of uncovering latent models or hidden structure. 55 00:05:13,988 --> 00:05:20,543 We'll talk about one particular way which is based on a matrix formulation of the 56 00:05:20,543 --> 00:05:25,500 problem there are many other ways which we'll mention shortly. 57 00:05:25,900 --> 00:05:34,256 But, this way talks about matrices. So, if you remember a matrix is an extension of a 58 00:05:34,256 --> 00:05:40,526 vector so it's a vector of vectors. So, we can think about this matrix mn 59 00:05:40,527 --> 00:05:47,211 matrix as having N column vectors each of size M or M row vectors each of size N. 60 00:05:47,211 --> 00:05:52,493 And these are just numbers. In our case, we will have a situation 61 00:05:52,493 --> 00:05:59,342 where the numbers will be just 1s and 0s. For example, a column here could represent 62 00:05:59,342 --> 00:06:06,009 a document where each entry is a one or a zero depending on which of the M possible 63 00:06:06,009 --> 00:06:09,560 words in the universe occur in that document. 64 00:06:10,080 --> 00:06:18,589 And so every document becomes a column. Similarly, a row could represent a word 65 00:06:18,589 --> 00:06:25,201 with a one or a zero occurring on, based on which document that word actually 66 00:06:25,201 --> 00:06:31,901 occurs in. Our approach to Layton models relies on 67 00:06:31,901 --> 00:06:36,720 factoring this matrix approximately into smaller matrices. 68 00:06:37,640 --> 00:06:43,705 And to explain shortly how that is done, but the point to note is that the smaller 69 00:06:43,705 --> 00:06:49,540 matrix is again of size M in terms of number of rows but has only K columns. 70 00:06:49,880 --> 00:06:57,875 And to make the matrix multiplication work we have to have this second matrix which 71 00:06:57,875 --> 00:07:06,781 has K rows and N columns, so an mk matrix into a kn matrix yields an mn matrix 72 00:07:06,781 --> 00:07:13,590 which is of the right structure. Go back to your matrix multiplication if 73 00:07:13,590 --> 00:07:17,480 you've forgotten it, because it will come in handy very shortly. 74 00:07:18,360 --> 00:07:23,528 Now let's assume that we actually had a factorization of this matrix. 75 00:07:23,528 --> 00:07:29,520 That is, this matrix say is approximately Xy What does this actually mean? 76 00:07:29,880 --> 00:07:34,904 Let's take a look at the column. Where we see the last column of this 77 00:07:34,904 --> 00:07:40,311 matrix A. In matrix multiplication terms, how does 78 00:07:40,311 --> 00:07:48,535 this column arise from those of X and Y? Well, the last column of A arises from the 79 00:07:48,535 --> 00:07:53,777 last column of Y. And the last column of Y tells us which 80 00:07:53,777 --> 00:07:58,641 combination. Or how to combine the columns of X to make 81 00:07:58,641 --> 00:08:03,588 this last column. So for example if you had one zero one 82 00:08:03,588 --> 00:08:10,846 zero one zero this would tell us exactly which columns of X to add up in order to 83 00:08:10,846 --> 00:08:19,251 make up this last column of A. Now let's see what that means in terms of 84 00:08:19,251 --> 00:08:23,703 documents and words. So if you had the last column indicating a 85 00:08:23,703 --> 00:08:32,624 document it tells us that this document is made up of a certain set of columns of X, 86 00:08:32,624 --> 00:08:40,386 not all of them, a certain set. So you could get this, this column of A, 87 00:08:40,386 --> 00:08:44,960 the last one by adding up certain columns of X. 88 00:08:46,000 --> 00:08:51,265 The columns of X are again certain combinations of works, 89 00:08:51,265 --> 00:08:58,563 In this case some of these may have large entries, some of these may have small 90 00:08:58,563 --> 00:08:59,580 entries. So, 91 00:09:00,420 --> 00:09:07,657 A set of words each of different weight is in some sense a topic. 92 00:09:07,657 --> 00:09:14,180 A topic about cloud computing may have the word cloud, computing, Amazon, 93 00:09:14,180 --> 00:09:20,980 virtualization and other such words. It might not have anything to do with 94 00:09:21,960 --> 00:09:27,640 thunderclouds or lightning or rivers or rain. 95 00:09:28,700 --> 00:09:35,554 So those words would be of lower weighted in that topic and the former ones would 96 00:09:35,554 --> 00:09:40,933 have higher weight. So we can view this document as a weighted 97 00:09:40,933 --> 00:09:47,961 combination of topics with this particular column telling us the weights that is 98 00:09:47,961 --> 00:09:52,560 which topic occurs in this document with what weight. 99 00:09:53,140 --> 00:10:01,402 Exactly the same formulation works for people buying books. So in this case an 100 00:10:01,402 --> 00:10:09,251 entry of A is one if this particular person identified by a certain row of A 101 00:10:09,251 --> 00:10:17,100 buys a particular book and a book is identified by the people that bought it. 102 00:10:19,140 --> 00:10:25,246 Again, let's look at what the last column of A looks like. 103 00:10:25,246 --> 00:10:30,173 It's a linear combination of the columns of X. 104 00:10:30,173 --> 00:10:36,280 And each column of X comprises of a set of people buying. 105 00:10:37,740 --> 00:10:43,725 Pattern of certain people and in some sense one can view this linear combination 106 00:10:43,725 --> 00:10:50,823 as linear combination of Jonners. Another way of looking at this, is to look 107 00:10:50,823 --> 00:10:56,134 at it row-wise. So one might say that the first row of A 108 00:10:56,134 --> 00:11:00,308 comprises of a combination of the rows of Y. 109 00:11:00,308 --> 00:11:07,990 With this particular row of X telling us what combination of rows of Y we should 110 00:11:07,990 --> 00:11:14,440 take to make up this first row of A. Coming back to people and books, 111 00:11:14,960 --> 00:11:21,017 This particular person represented by the first row of A can be thought of as 112 00:11:21,017 --> 00:11:27,462 playing multiple roles which are the rows of Y And the degree to which that person 113 00:11:27,462 --> 00:11:31,500 plays a particular role. It could be a poetry reader. 114 00:11:31,500 --> 00:11:36,703 It could be a technical reader. It could be a law reader or someone 115 00:11:36,703 --> 00:11:43,554 interested in science fiction, to the extent that each of these roles is present 116 00:11:43,554 --> 00:11:49,951 in this person is determined by the weight that the first role of X tells us about 117 00:11:49,951 --> 00:11:53,420 what the roles of this particular person are. 118 00:11:55,240 --> 00:12:03,385 So to summarize, we have generalized the problem of latent models to one of finding 119 00:12:03,385 --> 00:12:09,175 this factorization. Because once we have this factorization, 120 00:12:09,175 --> 00:12:14,867 we can think of the X matrix as topics or genres of books. 121 00:12:14,867 --> 00:12:23,381 And the Y matrix as roles of people. Now in order to do such a factorization, 122 00:12:23,381 --> 00:12:30,733 the matrix A needs to be written in the form XY but because these are smaller 123 00:12:30,733 --> 00:12:34,742 matrices, this is almost always an approximation. 124 00:12:34,742 --> 00:12:40,340 So we need some way of deciding what kind of approximation to make. 125 00:12:41,160 --> 00:12:49,221 One way is to minimize some measure of the difference between A and XY. 126 00:12:49,222 --> 00:12:55,654 The F norm or Frobenius norm is just saying, find the difference between A and 127 00:12:55,654 --> 00:13:00,360 X and Y, and take the sum of the squares of the elements. 128 00:13:01,100 --> 00:13:09,283 There are other ways of doing this minimization by for example choosing X to 129 00:13:09,283 --> 00:13:16,523 be a particular set of columns of A. And Y to be another set of columns of A, 130 00:13:16,523 --> 00:13:23,185 rows of A, and finding out which selection of columns of A and rows of A one can 131 00:13:23,185 --> 00:13:30,012 choose to then minimize this difference. We don't have to necessarily find X and Y 132 00:13:30,012 --> 00:13:34,592 from scratch, one can constrain them to be columns of A. 133 00:13:34,592 --> 00:13:41,170 You'll get different factorization, but either way it seems to work pretty well in 134 00:13:41,170 --> 00:13:46,576 practice. The constraint is that these entries need 135 00:13:46,576 --> 00:13:52,443 to be non-negative because it doesn't really make sense to have negative 136 00:13:52,443 --> 00:13:56,221 weights. So you can't have a person playing +0.5 137 00:13:56,221 --> 00:14:01,043 times science fiction reader and -0.3 times technical reader. 138 00:14:01,043 --> 00:14:05,544 It doesn't make sense. So positive weights are essential. 139 00:14:05,584 --> 00:14:11,412 And so all the increases are supposed to be non-negative so various types of 140 00:14:11,412 --> 00:14:15,270 non-negative matrix factorizations are possible. 141 00:14:15,270 --> 00:14:20,204 Some compute x and y from scratch to minimize this kind of a norm. 142 00:14:20,204 --> 00:14:25,960 Some choose X and Y from amongst the columns of A And in fact, there are many 143 00:14:25,960 --> 00:14:30,820 ways that one can come up with a non-negative matrix factorization. 144 00:14:31,860 --> 00:14:39,717 There are packages available that compute the non-negative matrix factorization in 145 00:14:39,717 --> 00:14:45,208 particular ways. And, we can easily use these algorithms to 146 00:14:45,208 --> 00:14:51,457 find topics, or find roles, or do recommendation systems, such as in 147 00:14:51,457 --> 00:14:56,380 recommending books and movies in Amazon and Netflix. 148 00:14:57,100 --> 00:15:04,481 In fact, there was a prize for being able to predict Netflix's ratings a few years 149 00:15:04,481 --> 00:15:09,280 ago and matrix techniques ended up being the winner. 150 00:15:12,200 --> 00:15:18,342 Matrix techniques are not the only way of finding latent models. 151 00:15:18,342 --> 00:15:24,900 There are other probabilistic ways such as the LDA or latent direct allocation. 152 00:15:24,900 --> 00:15:31,360 There are other matrix techniques like S singular value decomposition but these are 153 00:15:31,360 --> 00:15:35,898 things we don't have time to get into in this course. 154 00:15:35,898 --> 00:15:41,128 I felt that the matrix approach makes things extremely clear and also allows one 155 00:15:41,128 --> 00:15:46,973 to use a pre written package to easily find the factorization and therefore 156 00:15:46,973 --> 00:15:50,280 determine topics from a bunch of documents. 157 00:15:50,280 --> 00:15:56,280 All one needs to do is formulate the matrix A and let the package do the rest.