1 00:00:00,025 --> 00:00:03,770 [MUSIC] 2 00:00:03,770 --> 00:00:05,280 Hi. 3 00:00:05,280 --> 00:00:09,410 Often in computations, we have data like text and images. 4 00:00:09,410 --> 00:00:14,110 If you have only them, we can apply approach specific for this type of data. 5 00:00:14,110 --> 00:00:19,350 For example, we can use search engines in order to find similar text. 6 00:00:19,350 --> 00:00:23,140 That was the case in the Allen AI Challenge for example. 7 00:00:23,140 --> 00:00:26,928 For images, on the other hand, we can use conditional neural networks, 8 00:00:26,928 --> 00:00:31,550 like in the Data Science Bowl, and a whole bunch of other competitions. 9 00:00:31,550 --> 00:00:33,350 But if we have text or 10 00:00:33,350 --> 00:00:38,340 images as additional data, we usually must grasp different features, which 11 00:00:38,340 --> 00:00:44,030 can be edited as complementary to our main data frame of samples and features. 12 00:00:44,030 --> 00:00:48,353 Very simple example of such case we can see in the Titanic dataset we have called 13 00:00:48,353 --> 00:00:50,710 name, which is more or less like text, and 14 00:00:50,710 --> 00:00:54,195 to use it, we first need to derive the useful features from it. 15 00:00:54,195 --> 00:00:58,731 Another most surest example, we can predict whether a pair of online 16 00:00:58,731 --> 00:01:04,050 advertisements are duplicates, like slighty different copies of each other, 17 00:01:04,050 --> 00:01:09,056 and we could have images from these advertisements as complimentary data, 18 00:01:09,056 --> 00:01:12,754 like the Avito Duplicates Ads Detection competition. 19 00:01:12,754 --> 00:01:15,560 Or you may be given the task of classifying documents, 20 00:01:15,560 --> 00:01:18,490 like in the Tradeshift Text Classification Challenge. 21 00:01:19,870 --> 00:01:24,290 When feature extraction is done, we can treat extracted features differently. 22 00:01:24,290 --> 00:01:28,510 Sometimes we just want to add new features to existing dataframe. 23 00:01:28,510 --> 00:01:32,753 Sometimes we even might want to use the right features independently, and in end, 24 00:01:32,753 --> 00:01:34,702 make stake in with the base solution. 25 00:01:34,702 --> 00:01:39,660 We will go through stake in and we will learn how to apply it later in the topic 26 00:01:39,660 --> 00:01:45,013 about ensembles, but for now, you should know that both ways first to acquire, 27 00:01:45,013 --> 00:01:49,050 to of course extract features from text and images somehow. 28 00:01:49,050 --> 00:01:52,280 And this is exactly what we will discuss in this video. 29 00:01:53,510 --> 00:01:56,760 Let's start with featured extraction from text. 30 00:01:56,760 --> 00:01:58,646 There are two main ways to do this. 31 00:01:58,646 --> 00:02:02,002 First is to apply bag of words, and second, 32 00:02:02,002 --> 00:02:05,720 use embeddings like word to vector. 33 00:02:05,720 --> 00:02:09,840 Now, we'll talk about a bit about each of these methods, and 34 00:02:09,840 --> 00:02:14,140 in addition, we will go through text pre-processings related to them. 35 00:02:15,500 --> 00:02:19,350 Let's start with the first approach, the simplest one, bag of words. 36 00:02:20,410 --> 00:02:23,010 Here we create new column for 37 00:02:23,010 --> 00:02:28,480 each unique word from the data, then we simply count number of occurences for 38 00:02:28,480 --> 00:02:32,460 each word, and place this value in the appropriate column. 39 00:02:32,460 --> 00:02:34,310 After applying the separation to each row, 40 00:02:34,310 --> 00:02:39,210 we will have usual dataframe of samples and features. 41 00:02:39,210 --> 00:02:42,305 In a scalar, this can be done with CountVectorizer. 42 00:02:42,305 --> 00:02:48,940 We also can post process calculated metrics using some pre-defined methods. 43 00:02:48,940 --> 00:02:55,400 To make out why we need post-processing let's remember that some models like kNN, 44 00:02:55,400 --> 00:03:00,730 like neural regression, and neural networks, depend on scaling of features. 45 00:03:00,730 --> 00:03:03,280 So the main goal of post-processing here 46 00:03:03,280 --> 00:03:07,100 is to make samples more comparable on one side, and on the other, 47 00:03:07,100 --> 00:03:11,360 boost more important features while decreasing the scale of useless ones. 48 00:03:12,480 --> 00:03:16,310 One way to achieve the first goal of making a sample small comparable 49 00:03:16,310 --> 00:03:19,530 is to normalize sum of values in a row. 50 00:03:19,530 --> 00:03:25,130 In this way, we will count not occurrences but frequencies of words. 51 00:03:25,130 --> 00:03:29,330 Thus, texts of different sizes will be more comparable. 52 00:03:29,330 --> 00:03:32,640 This is the exact purpose of term frequency transformation. 53 00:03:34,012 --> 00:03:38,080 To achieve the second goal, that is to boost more important features, 54 00:03:38,080 --> 00:03:43,590 we'll make post process our matrix by normalizing data column wise. 55 00:03:43,590 --> 00:03:48,810 A good idea is to normalize each feature by the inverse fraction of documents, 56 00:03:48,810 --> 00:03:51,740 which contain the exact word corresponding to this feature. 57 00:03:52,840 --> 00:03:56,210 In this case, features corresponding to frequent words 58 00:03:56,210 --> 00:03:59,960 will be scaled down compared to features corresponding to rarer words. 59 00:04:01,090 --> 00:04:05,289 We can further improve this idea by taking a logarithm of these 60 00:04:05,289 --> 00:04:07,220 numberization coefficients. 61 00:04:07,220 --> 00:04:11,410 As a result, this will decrease the significance of widespread words in 62 00:04:11,410 --> 00:04:15,110 the dataset and do require feature scaling. 63 00:04:15,110 --> 00:04:18,540 This is the purpose of inverse document frequency transformation. 64 00:04:19,660 --> 00:04:24,631 General frequency, and inverse document frequency transformations, 65 00:04:24,631 --> 00:04:30,050 are often used together, like an sklearn, in TFiDF Vectorizer. 66 00:04:30,050 --> 00:04:34,070 Let's apply TFiDF transformation to the previous example. 67 00:04:34,070 --> 00:04:36,180 First, TF. 68 00:04:36,180 --> 00:04:36,833 Nice. 69 00:04:36,833 --> 00:04:39,652 Occurences which are switched to frequencies, 70 00:04:39,652 --> 00:04:43,108 that means some of variance for each row is now equal to one. 71 00:04:43,108 --> 00:04:44,019 Now, IDF, great. 72 00:04:44,019 --> 00:04:48,640 Now data is normalized column wise, and you can see, for 73 00:04:48,640 --> 00:04:51,335 those of you who are too excited, 74 00:04:51,335 --> 00:04:56,160 IDF transformation scaled down the appropriate feature. 75 00:04:57,470 --> 00:05:01,970 It's worth mentioning that there are plenty of other variants of TFiDF 76 00:05:01,970 --> 00:05:05,429 which may work better depending on the specific data. 77 00:05:05,429 --> 00:05:08,860 Another very useful technique is Ngrams. 78 00:05:08,860 --> 00:05:13,710 The concept of Ngram is simple, you add not only column corresponding to the word, 79 00:05:13,710 --> 00:05:17,340 but also columns corresponding to inconsequent words. 80 00:05:17,340 --> 00:05:21,092 This concept can also be applied to sequence of chars, and 81 00:05:21,092 --> 00:05:26,465 in cases with low N, we'll have a column for each possible combination of N chars. 82 00:05:26,465 --> 00:05:30,840 As we can see, for N = 1, number of these columns will be equal to 28. 83 00:05:30,840 --> 00:05:36,220 Let's calculate number of these columns for N = 2. 84 00:05:36,220 --> 00:05:38,890 Well, it will be 28 squared. 85 00:05:38,890 --> 00:05:43,681 Note that sometimes it can be cheaper to have every possible char Ngram as 86 00:05:43,681 --> 00:05:48,876 a feature, instead of having a feature for each unique word from the dataset. 87 00:05:48,876 --> 00:05:54,260 Using char Ngrams also helps our model to handle unseen words. 88 00:05:54,260 --> 00:05:57,389 For example, rare forms of already used words. 89 00:05:58,640 --> 00:06:02,999 In a scalared count vectorizor has appropriate parameter for 90 00:06:02,999 --> 00:06:06,140 using Ngrams, it is called Ngram_range. 91 00:06:06,140 --> 00:06:12,598 To change from word Ngrams to char Ngrams, you may use parameter named analyzer. 92 00:06:12,598 --> 00:06:18,020 Usually, you may want to preprocess text, even before applying bag of words, 93 00:06:18,020 --> 00:06:22,860 and sometimes, careful text preprocessing can help bag of words drastically. 94 00:06:23,940 --> 00:06:28,030 Here, we will discuss such methods as converting text to lowercase, 95 00:06:28,030 --> 00:06:31,320 lemmatization, stemming, and the usage of stopwords. 96 00:06:32,330 --> 00:06:36,125 Let's consider simple example which shows utility of lowercase. 97 00:06:36,125 --> 00:06:41,500 What if we applied bag of words to the sentence very, very sunny? 98 00:06:41,500 --> 00:06:44,440 We will get three columns for each word. 99 00:06:44,440 --> 00:06:50,020 So because Very, with capital letter, is not the same string as very without it, 100 00:06:50,020 --> 00:06:53,300 we will get multiple columns for the same word, and 101 00:06:53,300 --> 00:06:57,710 again, Sunny with capital letter doesn't match sunny without it. 102 00:06:57,710 --> 00:07:03,106 So, first preprocessing what we want to do is to apply lowercase to our text. 103 00:07:03,106 --> 00:07:08,740 Fortunately, configurizer from sklearn does this by default. 104 00:07:09,960 --> 00:07:12,950 Now, let's move on to lemmatization and stemming. 105 00:07:14,010 --> 00:07:17,870 These methods refer to more advanced preprocessing. 106 00:07:17,870 --> 00:07:19,540 Let's look at this example. 107 00:07:19,540 --> 00:07:23,500 We have two sentences: I had a car, and We have cars. 108 00:07:24,560 --> 00:07:30,580 We may want to unify the words car and cars, which are basically the same word. 109 00:07:30,580 --> 00:07:33,190 The same goes for had and have, and so on. 110 00:07:34,270 --> 00:07:38,525 Both stemming and lemmatization may be used to fulfill this purpose, but 111 00:07:38,525 --> 00:07:40,730 they achieve this in different ways. 112 00:07:41,930 --> 00:07:46,867 Stemming usually refers to a heuristic process that chops off ending of words and 113 00:07:46,867 --> 00:07:51,296 thus unite duration of related words like democracy, democratic, and 114 00:07:51,296 --> 00:07:57,250 democratization, producing something like, democr, for each of these words. 115 00:07:57,250 --> 00:08:02,200 Lemmatization, on the hand, usually means that you have want to do this carefully 116 00:08:02,200 --> 00:08:06,200 using knowledge or vocabulary, and morphological analogies of force, 117 00:08:06,200 --> 00:08:08,910 returning democracy for each of the words below. 118 00:08:10,080 --> 00:08:14,070 Let's look at another example that shows the difference between stemming and 119 00:08:14,070 --> 00:08:16,844 lemmatization by applying them to word saw. 120 00:08:16,844 --> 00:08:21,950 While stemming will return on the letter s, lemmatization will 121 00:08:21,950 --> 00:08:27,080 try to return either see or saw, dependent on the word's meaning. 122 00:08:27,080 --> 00:08:30,340 The last technique for text preprocessing, 123 00:08:30,340 --> 00:08:34,370 which we will discuss here, is usage of stopwords. 124 00:08:34,370 --> 00:08:39,430 Basically, stopwords are words which do not contain important information for 125 00:08:39,430 --> 00:08:40,630 our model. 126 00:08:40,630 --> 00:08:45,480 They are either insignificant like articles or prepositions, or 127 00:08:45,480 --> 00:08:48,860 so common they do not help to solve our task. 128 00:08:48,860 --> 00:08:53,280 Most languages have predefined list of stopwords which can be found 129 00:08:53,280 --> 00:08:56,040 on the Internet or logged from NLTK, 130 00:08:56,040 --> 00:09:00,378 which stands for Natural Language Toolkit Library for Python. 131 00:09:00,378 --> 00:09:06,229 CountVectorizer from sklearn also has parameter related to stopwords, 132 00:09:06,229 --> 00:09:08,321 which is called max_df. 133 00:09:08,321 --> 00:09:11,796 max_df is the threshold of words we can see, 134 00:09:11,796 --> 00:09:16,837 after we see in which, the word will be removed from text corpus. 135 00:09:16,837 --> 00:09:23,680 Good, we just have discussed classical feature extraction pipeline for text. 136 00:09:23,680 --> 00:09:27,420 At the beginning, we may want to pre-process our text. 137 00:09:27,420 --> 00:09:31,990 To do so, we can apply lowercase, stemming, lemmatization, or 138 00:09:31,990 --> 00:09:33,880 remove stopwords. 139 00:09:33,880 --> 00:09:38,980 After preprocessing, we can use bag of words approach to get the matrix 140 00:09:38,980 --> 00:09:44,060 where each row represents a text, and each column represents a unique word. 141 00:09:45,130 --> 00:09:49,393 Also, we can use bag of words approach for Ngrams, and 142 00:09:49,393 --> 00:09:54,504 in new columns for groups of several consecutive words or chars. 143 00:09:54,504 --> 00:09:59,381 And in the end, when we postprocess these metrics using TFiDF, 144 00:09:59,381 --> 00:10:01,830 which often prove to be useful. 145 00:10:03,190 --> 00:10:08,310 Well, then now we can add extracted features to our basic data frame, 146 00:10:08,310 --> 00:10:12,350 or putting the dependent model on them to create some tricky features. 147 00:10:13,360 --> 00:10:14,730 That's all for now. 148 00:10:14,730 --> 00:10:19,200 In the next video, we will continue to discuss feature extraction. 149 00:10:19,200 --> 00:10:21,690 We'll go through two big points. 150 00:10:21,690 --> 00:10:24,704 First, we'll talk about approach for 151 00:10:24,704 --> 00:10:29,160 texts, and second, we will discuss feature extraction for images. 152 00:10:29,160 --> 00:10:35,070 [MUSIC]