[MUSIC] Hi everyone, in this video I will talk about the application of matrix factorization technique in feature extraction. You will see a few application of the approach for feature extraction and we will be able to apply it. I will show you several examples along with practical details. Here's a classic example of recommendations. Suppose we have some information about user, like age, region, interest and items like gender, year length. Also we know ratings that users gave to some items. These ratings can be organized in a user item matrix with row corresponding to users, and columns corresponding to items, as shown in the picture. In a cell with coordinates i, j, the user or agent can be chooser i, give the item j. Assume that our user have some features Ui. And jth item have is corresponding feature Mj. And scalar product of these features produce a rating Rij. Now we can apply matrix factorization to learning those features for item and users. Sometimes these features can have an interpretation. Like the first feature in item can be measured of or something similar. But generally you should consider them as some extra features, which we can use to encode user in the same way as we did before with labeling coder or coder. Specifically our assumption about scale of product is the following. If we present all attributes of user and items as matrixes, the matrix product will be very close to the matrix's ratings. In other words, which way to find matrix's U and M, such as their product gives the matrix R. This way, this approach is called matrix factorization or matrix composition. In previous examples, we used both row and column related features. But sometimes we don't let the features correspond to rows. Let's consider another example. Suppose that we are texts, do you remember how we usually classify text? We extract features and each document was described by a large sparse reactor. If we do matrix factorization over these parse features, we will get the representation for index displayed in yellow, and terms displayed in green. Although we can somehow use representation for jumps, we are interested only in representations for dogs. Now every document is described by a small, dense reactor. These are our features, and we can use them in a way similar to previous example. This case is often called dimension energy reduction. It's quite an efficient way to reduce the size of feature metric, and extract real valued features from categorical ones. In competitions we often have different options for purchasing. For example, using text data, you can run back of big rams and so on. Using matrix optimization technique, you are able to extract features from all of these matrices. Since the resulting matrices will be small, we can easily join them and use togetherness of the features in tree-based models. Now I want to make a few comments about matrix factorization. Not just that we are not constrained to reduce whole matrix, you can apply factorization to a subset of a column and leave the other as is. Besides reduction you can use pressure boards for getting another presentation of the same data. This is especially useful for example since it provides velocity of its models and leads to a better. Of course matrix factorization is a loss of transformation, in other words we will lose some information after the search reduction. Efficiency of this approach heavily depends on a particular task and choose a number of latent factors. The number should be considered as a hyper parameter and needs to be tuned. It's a good practice to choose a number of factors between 5 and 100. Now, let's switch from general idea to particular implementations. Several matrix factorization methods are implemented in circuit as the most famous SVD and PCA. In addition, their use included TruncatedSVD, which can work with sparse matrices. It's very convenient for example, in case of text datasets. Also there exists a so called non-negative matrix factorization, or NMF. It impose an additional restrictions that all hidden factors are non-negative, that is either zero or a positive number. It can be applied only to non-negative matrixes. For example matrix where all represented occurence of each word in the document. NMF has an interesting property, it transforms data in a way that makes data more suitable for decision trees. Take a look at the picture from Microsoft Mobile Classification Challenge. It can be seen that NMF transform data forms lines parallel to the axis. A few more notes on matrix factorizations. Essentially they are very similar to linear models, so we can use the same transformation tricks as we use for linear models. So in addition to standard NMF, I advise you to apply the factorization to transform data. Here's another plot from the competition. It's clear that these two transformations produce different features, and we don't have to choose the best one. Instead, it's beneficial to use both of them. I want to note that matrix factorization is a trainable transformation, and has its own parameters. So we should be careful, and use the same transformation for all parts of your data set. Reading and transforming each part individually is wrong, because in that case you will get two different transformations. This can lead to an error which will be hard to find. The correct method is shown below, first we need to the data information on all data and only then apply to each individual piece. To sum up, matrix composition is a very general approach to dimensional reduction and feature extraction. It can be used to transform categorical feature into real ones. And tricks for linear models are also suitable for matrix factorizations. Thank you for your attention. [MUSIC] [SOUND]