1 00:00:00,092 --> 00:00:04,002 [MUSIC] 2 00:00:04,002 --> 00:00:07,545 Hi everyone, in this video I will talk about the application of matrix 3 00:00:07,545 --> 00:00:10,760 factorization technique in feature extraction. 4 00:00:10,760 --> 00:00:13,559 You will see a few application of the approach for 5 00:00:13,559 --> 00:00:16,500 feature extraction and we will be able to apply it. 6 00:00:16,500 --> 00:00:19,853 I will show you several examples along with practical details. 7 00:00:19,853 --> 00:00:22,900 Here's a classic example of recommendations. 8 00:00:22,900 --> 00:00:27,370 Suppose we have some information about user, like age, region, interest and 9 00:00:27,370 --> 00:00:29,990 items like gender, year length. 10 00:00:29,990 --> 00:00:33,390 Also we know ratings that users gave to some items. 11 00:00:33,390 --> 00:00:37,940 These ratings can be organized in a user item matrix with row corresponding to 12 00:00:37,940 --> 00:00:42,140 users, and columns corresponding to items, as shown in the picture. 13 00:00:42,140 --> 00:00:45,917 In a cell with coordinates i, j, the user or agent can be chooser i, 14 00:00:45,917 --> 00:00:47,590 give the item j. 15 00:00:47,590 --> 00:00:51,210 Assume that our user have some features Ui. 16 00:00:51,210 --> 00:00:54,460 And jth item have is corresponding feature Mj. 17 00:00:54,460 --> 00:00:57,870 And scalar product of these features produce a rating Rij. 18 00:00:57,870 --> 00:01:02,580 Now we can apply matrix factorization to learning those features for 19 00:01:02,580 --> 00:01:04,260 item and users. 20 00:01:04,260 --> 00:01:07,520 Sometimes these features can have an interpretation. 21 00:01:07,520 --> 00:01:12,076 Like the first feature in item can be measured of or something similar. 22 00:01:12,076 --> 00:01:16,470 But generally you should consider them as some extra features, which we can use 23 00:01:16,470 --> 00:01:21,350 to encode user in the same way as we did before with labeling coder or coder. 24 00:01:21,350 --> 00:01:25,300 Specifically our assumption about scale of product is the following. 25 00:01:25,300 --> 00:01:27,480 If we present all attributes of user and 26 00:01:27,480 --> 00:01:33,520 items as matrixes, the matrix product will be very close to the matrix's ratings. 27 00:01:33,520 --> 00:01:36,100 In other words, which way to find matrix's U and 28 00:01:36,100 --> 00:01:39,820 M, such as their product gives the matrix R. 29 00:01:39,820 --> 00:01:44,210 This way, this approach is called matrix factorization or matrix composition. 30 00:01:44,210 --> 00:01:47,920 In previous examples, we used both row and column related features. 31 00:01:47,920 --> 00:01:50,920 But sometimes we don't let the features correspond to rows. 32 00:01:50,920 --> 00:01:52,640 Let's consider another example. 33 00:01:52,640 --> 00:01:56,908 Suppose that we are texts, do you remember how we usually classify text? 34 00:01:56,908 --> 00:02:02,780 We extract features and each document was described by a large sparse reactor. 35 00:02:02,780 --> 00:02:06,200 If we do matrix factorization over these parse features, we will get 36 00:02:06,200 --> 00:02:11,070 the representation for index displayed in yellow, and terms displayed in green. 37 00:02:11,070 --> 00:02:12,940 Although we can somehow use representation for 38 00:02:12,940 --> 00:02:16,560 jumps, we are interested only in representations for dogs. 39 00:02:16,560 --> 00:02:19,770 Now every document is described by a small, dense reactor. 40 00:02:19,770 --> 00:02:24,320 These are our features, and we can use them in a way similar to previous example. 41 00:02:24,320 --> 00:02:27,820 This case is often called dimension energy reduction. 42 00:02:27,820 --> 00:02:31,200 It's quite an efficient way to reduce the size of feature metric, and 43 00:02:31,200 --> 00:02:35,340 extract real valued features from categorical ones. 44 00:02:35,340 --> 00:02:38,990 In competitions we often have different options for purchasing. 45 00:02:38,990 --> 00:02:47,540 For example, using text data, you can run back of big rams and so on. 46 00:02:47,540 --> 00:02:49,180 Using matrix optimization technique, 47 00:02:49,180 --> 00:02:52,550 you are able to extract features from all of these matrices. 48 00:02:52,550 --> 00:02:56,270 Since the resulting matrices will be small, we can easily join them and 49 00:02:56,270 --> 00:02:59,370 use togetherness of the features in tree-based models. 50 00:02:59,370 --> 00:03:03,110 Now I want to make a few comments about matrix factorization. 51 00:03:03,110 --> 00:03:06,440 Not just that we are not constrained to reduce whole matrix, 52 00:03:06,440 --> 00:03:11,020 you can apply factorization to a subset of a column and leave the other as is. 53 00:03:11,020 --> 00:03:13,530 Besides reduction you can use pressure boards for 54 00:03:13,530 --> 00:03:16,160 getting another presentation of the same data. 55 00:03:17,250 --> 00:03:18,580 This is especially useful for 56 00:03:18,580 --> 00:03:24,360 example since it provides velocity of its models and leads to a better. 57 00:03:24,360 --> 00:03:27,440 Of course matrix factorization is a loss of transformation, 58 00:03:27,440 --> 00:03:30,650 in other words we will lose some information after the search reduction. 59 00:03:31,670 --> 00:03:35,300 Efficiency of this approach heavily depends on a particular task and 60 00:03:35,300 --> 00:03:37,650 choose a number of latent factors. 61 00:03:37,650 --> 00:03:41,730 The number should be considered as a hyper parameter and needs to be tuned. 62 00:03:41,730 --> 00:03:45,834 It's a good practice to choose a number of factors between 5 and 100. 63 00:03:45,834 --> 00:03:49,550 Now, let's switch from general idea to particular implementations. 64 00:03:49,550 --> 00:03:53,608 Several matrix factorization methods are implemented in circuit 65 00:03:53,608 --> 00:03:56,580 as the most famous SVD and PCA. 66 00:03:56,580 --> 00:03:58,780 In addition, their use included TruncatedSVD, 67 00:03:58,780 --> 00:04:01,010 which can work with sparse matrices. 68 00:04:01,010 --> 00:04:04,530 It's very convenient for example, in case of text datasets. 69 00:04:04,530 --> 00:04:10,180 Also there exists a so called non-negative matrix factorization, or NMF. 70 00:04:10,180 --> 00:04:14,333 It impose an additional restrictions that all hidden factors are non-negative, 71 00:04:14,333 --> 00:04:17,520 that is either zero or a positive number. 72 00:04:17,520 --> 00:04:19,990 It can be applied only to non-negative matrixes. 73 00:04:19,990 --> 00:04:25,370 For example matrix where all represented occurence of each word in the document. 74 00:04:25,370 --> 00:04:27,480 NMF has an interesting property, 75 00:04:27,480 --> 00:04:32,490 it transforms data in a way that makes data more suitable for decision trees. 76 00:04:32,490 --> 00:04:35,500 Take a look at the picture from Microsoft Mobile Classification Challenge. 77 00:04:35,500 --> 00:04:41,460 It can be seen that NMF transform data forms lines parallel to the axis. 78 00:04:41,460 --> 00:04:44,080 A few more notes on matrix factorizations. 79 00:04:44,080 --> 00:04:47,140 Essentially they are very similar to linear models, so 80 00:04:47,140 --> 00:04:51,250 we can use the same transformation tricks as we use for linear models. 81 00:04:51,250 --> 00:04:53,480 So in addition to standard NMF, 82 00:04:53,480 --> 00:04:57,810 I advise you to apply the factorization to transform data. 83 00:04:57,810 --> 00:05:00,700 Here's another plot from the competition. 84 00:05:00,700 --> 00:05:04,760 It's clear that these two transformations produce different features, and 85 00:05:04,760 --> 00:05:06,610 we don't have to choose the best one. 86 00:05:06,610 --> 00:05:09,680 Instead, it's beneficial to use both of them. 87 00:05:09,680 --> 00:05:14,130 I want to note that matrix factorization is a trainable transformation, and 88 00:05:14,130 --> 00:05:16,322 has its own parameters. 89 00:05:16,322 --> 00:05:18,990 So we should be careful, and use the same transformation for 90 00:05:18,990 --> 00:05:20,360 all parts of your data set. 91 00:05:21,360 --> 00:05:24,630 Reading and transforming each part individually is wrong, 92 00:05:24,630 --> 00:05:28,020 because in that case you will get two different transformations. 93 00:05:28,020 --> 00:05:30,490 This can lead to an error which will be hard to find. 94 00:05:31,630 --> 00:05:34,480 The correct method is shown below, first we need to 95 00:05:34,480 --> 00:05:40,030 the data information on all data and only then apply to each individual piece. 96 00:05:40,030 --> 00:05:44,550 To sum up, matrix composition is a very general approach to dimensional reduction 97 00:05:44,550 --> 00:05:46,390 and feature extraction. 98 00:05:46,390 --> 00:05:49,410 It can be used to transform categorical feature into real ones. 99 00:05:50,730 --> 00:05:55,440 And tricks for linear models are also suitable for matrix factorizations. 100 00:05:55,440 --> 00:05:56,413 Thank you for your attention. 101 00:05:56,413 --> 00:06:03,598 [MUSIC] 102 00:06:03,598 --> 00:06:09,314 [SOUND]