In the web, one often encounters the long tail phenomena, which has to do with the fact that data is very high dimensional. For an example, think of the recommender systems that one encounters while browsing books on Amazon. The system tells us which books are frequently bought together. It also tells us that the customers who bought the book that we are browsing also bought some other books. Now. The important thing to note is that no particular set of books in the web which has so many, many different books has a high support. In fact in the limit, for any combination of books the support is close to zero. There are just so many books that no particular combination will have high support. So, association rule mining simply can't tell us which books are bought together. Further, things like the customers who bought this item also bought other items. Well, Customers need to be compared. And people have varied interests. So, how do we choose amongst the books that the customers who have bought this book have also bought. We need some way of choosing from those many choices that each of those customers have made, so what's actually going on here. It's not traditional rule mining, it's not traditional clustering, one is not really looking at books that are similar to each other in terms of their content or the words that they contain. Yet, One is looking at books which are similar because of some other reasons. They all happen to deal with the same subject here merely because the people who bought them were similar and they bought similar books. At the same time, we've filtered out the fact that the people have varied interests. So, somebody interested in cloud computing might also be interested in big data, might also be interested in poetry. But we don't see poetry books over here. So what's going on? Well, as we mentioned long back some of these data sets are actually bipartite, in the sense that we have transactions where people buy books and books are bought by people. There are other situations where we have exactly the same structure. Documents contain words, and words are contained in documents. Similarly, Our baby experiences animals. So, in an experience you may see an animal which has certain features and certain features are present in different animals. So for example, four legs is present in many animals. Two legs in other sets of animals, etc cetera. In general, we have observations and perceptions. Our observations are grouped together. And each observation consists of many perceptions and each perception occurs in many observations. So the question we're trying to ask now is whether we can do something more with the structure and techniques such as Collaborative Filtering, Latent Semantic Models,,and and others are actually able to extract the hidden structure between different elements of such transactions and essentially extract classes and features from such a graph. So in this case, for example, books are being characterized not by the words that they contain but by the people who bought them. And people in turns are being are being characterized by the books that they've bought. So for people, books are the features; for books, people are the features. Similarly, for documents, words are the features; for words, documents are the features. Let's go on to see how one might model this bipartite problem. And on earth the hidden structure where classes and features are in some sense interchangeable. There are in fact many ways to approach the business of uncovering latent models or hidden structure. We'll talk about one particular way which is based on a matrix formulation of the problem there are many other ways which we'll mention shortly. But, this way talks about matrices. So, if you remember a matrix is an extension of a vector so it's a vector of vectors. So, we can think about this matrix mn matrix as having N column vectors each of size M or M row vectors each of size N. And these are just numbers. In our case, we will have a situation where the numbers will be just 1s and 0s. For example, a column here could represent a document where each entry is a one or a zero depending on which of the M possible words in the universe occur in that document. And so every document becomes a column. Similarly, a row could represent a word with a one or a zero occurring on, based on which document that word actually occurs in. Our approach to Layton models relies on factoring this matrix approximately into smaller matrices. And to explain shortly how that is done, but the point to note is that the smaller matrix is again of size M in terms of number of rows but has only K columns. And to make the matrix multiplication work we have to have this second matrix which has K rows and N columns, so an mk matrix into a kn matrix yields an mn matrix which is of the right structure. Go back to your matrix multiplication if you've forgotten it, because it will come in handy very shortly. Now let's assume that we actually had a factorization of this matrix. That is, this matrix say is approximately Xy What does this actually mean? Let's take a look at the column. Where we see the last column of this matrix A. In matrix multiplication terms, how does this column arise from those of X and Y? Well, the last column of A arises from the last column of Y. And the last column of Y tells us which combination. Or how to combine the columns of X to make this last column. So for example if you had one zero one zero one zero this would tell us exactly which columns of X to add up in order to make up this last column of A. Now let's see what that means in terms of documents and words. So if you had the last column indicating a document it tells us that this document is made up of a certain set of columns of X, not all of them, a certain set. So you could get this, this column of A, the last one by adding up certain columns of X. The columns of X are again certain combinations of works, In this case some of these may have large entries, some of these may have small entries. So, A set of words each of different weight is in some sense a topic. A topic about cloud computing may have the word cloud, computing, Amazon, virtualization and other such words. It might not have anything to do with thunderclouds or lightning or rivers or rain. So those words would be of lower weighted in that topic and the former ones would have higher weight. So we can view this document as a weighted combination of topics with this particular column telling us the weights that is which topic occurs in this document with what weight. Exactly the same formulation works for people buying books. So in this case an entry of A is one if this particular person identified by a certain row of A buys a particular book and a book is identified by the people that bought it. Again, let's look at what the last column of A looks like. It's a linear combination of the columns of X. And each column of X comprises of a set of people buying. Pattern of certain people and in some sense one can view this linear combination as linear combination of Jonners. Another way of looking at this, is to look at it row-wise. So one might say that the first row of A comprises of a combination of the rows of Y. With this particular row of X telling us what combination of rows of Y we should take to make up this first row of A. Coming back to people and books, This particular person represented by the first row of A can be thought of as playing multiple roles which are the rows of Y And the degree to which that person plays a particular role. It could be a poetry reader. It could be a technical reader. It could be a law reader or someone interested in science fiction, to the extent that each of these roles is present in this person is determined by the weight that the first role of X tells us about what the roles of this particular person are. So to summarize, we have generalized the problem of latent models to one of finding this factorization. Because once we have this factorization, we can think of the X matrix as topics or genres of books. And the Y matrix as roles of people. Now in order to do such a factorization, the matrix A needs to be written in the form XY but because these are smaller matrices, this is almost always an approximation. So we need some way of deciding what kind of approximation to make. One way is to minimize some measure of the difference between A and XY. The F norm or Frobenius norm is just saying, find the difference between A and X and Y, and take the sum of the squares of the elements. There are other ways of doing this minimization by for example choosing X to be a particular set of columns of A. And Y to be another set of columns of A, rows of A, and finding out which selection of columns of A and rows of A one can choose to then minimize this difference. We don't have to necessarily find X and Y from scratch, one can constrain them to be columns of A. You'll get different factorization, but either way it seems to work pretty well in practice. The constraint is that these entries need to be non-negative because it doesn't really make sense to have negative weights. So you can't have a person playing +0.5 times science fiction reader and -0.3 times technical reader. It doesn't make sense. So positive weights are essential. And so all the increases are supposed to be non-negative so various types of non-negative matrix factorizations are possible. Some compute x and y from scratch to minimize this kind of a norm. Some choose X and Y from amongst the columns of A And in fact, there are many ways that one can come up with a non-negative matrix factorization. There are packages available that compute the non-negative matrix factorization in particular ways. And, we can easily use these algorithms to find topics, or find roles, or do recommendation systems, such as in recommending books and movies in Amazon and Netflix. In fact, there was a prize for being able to predict Netflix's ratings a few years ago and matrix techniques ended up being the winner. Matrix techniques are not the only way of finding latent models. There are other probabilistic ways such as the LDA or latent direct allocation. There are other matrix techniques like S singular value decomposition but these are things we don't have time to get into in this course. I felt that the matrix approach makes things extremely clear and also allows one to use a pre written package to easily find the factorization and therefore determine topics from a bunch of documents. All one needs to do is formulate the matrix A and let the package do the rest.