1 00:00:02,750 --> 00:00:06,705 We've just covered yet another field of machine learning, 2 00:00:06,705 --> 00:00:10,890 the deep unsupervised learning with autoencoders and all that kind of stuff. 3 00:00:10,890 --> 00:00:14,265 To get a full collection, let's now cover another kind of application domain. 4 00:00:14,265 --> 00:00:18,610 Mainly, let's talk about how do we apply unsupervised learning to text data. 5 00:00:18,610 --> 00:00:23,430 A disclaimer though, we won't have any time to cover text in depth. 6 00:00:23,430 --> 00:00:26,310 However, you'll be able to get much more information about them 7 00:00:26,310 --> 00:00:30,330 in the appropriate course in our specialization. 8 00:00:30,330 --> 00:00:31,778 What's a text? 9 00:00:31,778 --> 00:00:34,540 Now, there's a lot of linguistic articles describing 10 00:00:34,540 --> 00:00:37,590 text in much more high-level notions, but, for us, 11 00:00:37,590 --> 00:00:40,545 it would be sufficient to say that text is a sequence of 12 00:00:40,545 --> 00:00:44,070 words or a sequence of characters if you don't want to go that deep. 13 00:00:44,070 --> 00:00:47,080 A word or character or some kind of token, 14 00:00:47,080 --> 00:00:48,780 is just an atomic element of text. 15 00:00:48,780 --> 00:00:51,545 Basically, it closes the circle. 16 00:00:51,545 --> 00:00:54,945 Let's just assume that we have a finite number of words, 17 00:00:54,945 --> 00:00:57,630 which is probably untrue. 18 00:00:57,630 --> 00:01:00,960 We consider a text a symbolic sequences of those words. 19 00:01:00,960 --> 00:01:03,330 The usual way you handle text is, 20 00:01:03,330 --> 00:01:04,680 just you apply some filtering, 21 00:01:04,680 --> 00:01:07,350 some pre-processing like a string of characters 22 00:01:07,350 --> 00:01:12,120 basically and the first thing you do is you filter out some of them, 23 00:01:12,120 --> 00:01:13,965 the time delimiter to your problem. 24 00:01:13,965 --> 00:01:16,200 For example, if you're trying to, for example, 25 00:01:16,200 --> 00:01:20,610 predicts the emotional sentiment of a movie reviewer, 26 00:01:20,610 --> 00:01:23,040 then sometimes you can ignore the punctuation signs 27 00:01:23,040 --> 00:01:25,370 or maybe some kind of brackets and other, 28 00:01:25,370 --> 00:01:28,100 maybe HTML or XML if there are some. 29 00:01:28,100 --> 00:01:31,860 Then, the thing you have to do is you have to split the text into those tokens 30 00:01:31,860 --> 00:01:35,435 so you have some irregular expressions or 31 00:01:35,435 --> 00:01:39,480 maybe some more complicated tokenizer that takes the string 32 00:01:39,480 --> 00:01:44,135 and splits it into a list of entities that belong to a certain dictionary, 33 00:01:44,135 --> 00:01:47,635 a list of all characters or all words for example. 34 00:01:47,635 --> 00:01:51,110 Then you actually need to somehow exact features from 35 00:01:51,110 --> 00:01:55,385 your text in order to make any machine learning models available for you. 36 00:01:55,385 --> 00:01:58,450 The simplest and the most popular way you can do 37 00:01:58,450 --> 00:02:01,608 that is by using the so-called Bag of Words approach. 38 00:02:01,608 --> 00:02:04,890 This approach constructs a vector of features instead vector of counts. 39 00:02:04,890 --> 00:02:06,010 So, for every word, 40 00:02:06,010 --> 00:02:08,720 you basically count how many times does this word present. 41 00:02:08,720 --> 00:02:10,656 For example, in this text here, 42 00:02:10,656 --> 00:02:12,865 the word journal is present three times while the word 43 00:02:12,865 --> 00:02:15,690 learning is not present so it gets zero. 44 00:02:15,690 --> 00:02:17,760 Basically, you go over 45 00:02:17,760 --> 00:02:19,979 all the words you have in dictionary and start this large feature vector. 46 00:02:19,979 --> 00:02:27,315 Now, if you had any experience dealing with computational linguistics, 47 00:02:27,315 --> 00:02:30,000 you probably know that there are many words tens of thousands, 48 00:02:30,000 --> 00:02:31,575 if not hundreds of thousands. 49 00:02:31,575 --> 00:02:35,175 If you include all the possible typos for example, 50 00:02:35,175 --> 00:02:37,620 this list grows even further. 51 00:02:37,620 --> 00:02:42,120 The problem with this approach is although it constructs a feature vector, 52 00:02:42,120 --> 00:02:45,960 it's absolutely ignores the word ordering for example. 53 00:02:45,960 --> 00:02:47,295 If you have word no, 54 00:02:47,295 --> 00:02:52,410 you won't actually be able to restore the position of this word and you won't be able 55 00:02:52,410 --> 00:02:54,900 to understand where the negation was applied in 56 00:02:54,900 --> 00:02:58,350 the sentence so you just know that there is one word no, 57 00:02:58,350 --> 00:03:00,425 and this is a problem. 58 00:03:00,425 --> 00:03:02,204 We will, of course, deal with this problem, 59 00:03:02,204 --> 00:03:03,230 but before we do that, 60 00:03:03,230 --> 00:03:06,060 let's find maybe a way we can use this representation to gain 61 00:03:06,060 --> 00:03:11,600 some efficiency in sentiment classification for example. 62 00:03:11,600 --> 00:03:14,225 This problem of sentiment classification, 63 00:03:14,225 --> 00:03:17,635 it's a problem of trying to predict where a particular text. 64 00:03:17,635 --> 00:03:22,700 For example, a movie review or a tweet was emotionally positive, negative, or neutral. 65 00:03:22,700 --> 00:03:24,770 Of course, this is not the only or 66 00:03:24,770 --> 00:03:27,380 the most important problem in machine learning processing, 67 00:03:27,380 --> 00:03:28,910 but it is an important problem. 68 00:03:28,910 --> 00:03:31,525 For example, if you could predict sentiments efficiently, 69 00:03:31,525 --> 00:03:34,550 you would be able to use this model to survey 70 00:03:34,550 --> 00:03:38,355 the social media for example so you have your new product. 71 00:03:38,355 --> 00:03:43,732 You can just grab all the tweets that mention your product and find what kinds of, 72 00:03:43,732 --> 00:03:48,875 for example, age groups have the most positive opinion of your new product. 73 00:03:48,875 --> 00:03:53,585 Then you can for example get some insight on how to advertise it most efficiently. 74 00:03:53,585 --> 00:03:55,895 Now, this problem, of course, 75 00:03:55,895 --> 00:03:59,315 is not the hardest one but 76 00:03:59,315 --> 00:04:03,300 there's a lot of methods that work across all the machine learning processing. 77 00:04:03,300 --> 00:04:07,085 We now study of those methods applied to text classification or regression. 78 00:04:07,085 --> 00:04:09,752 One popular approach you could try to solve this problem, 79 00:04:09,752 --> 00:04:12,230 you could use your Bag of Words, 80 00:04:12,230 --> 00:04:15,895 word count frequencies as features for any classifier you would like to use. 81 00:04:15,895 --> 00:04:17,880 For example, you could try logistic regression. 82 00:04:17,880 --> 00:04:19,355 For this particular problem, 83 00:04:19,355 --> 00:04:23,450 logistic regression and other linear models are very kind of easy to interpret. 84 00:04:23,450 --> 00:04:25,010 They do a very sensible thing. 85 00:04:25,010 --> 00:04:28,445 What logistic regression does is it takes each word. 86 00:04:28,445 --> 00:04:32,407 Basically, each of the Bag of Words features is a count of a particular word, 87 00:04:32,407 --> 00:04:33,620 and for this word, 88 00:04:33,620 --> 00:04:36,425 it has a special dedicated weight where 89 00:04:36,425 --> 00:04:41,470 large weights correspond to kind of emotionally positive, 90 00:04:41,470 --> 00:04:43,675 hugely positive words like awesome, 91 00:04:43,675 --> 00:04:46,710 liked, enjoyed, if it's a mood review for example. 92 00:04:46,710 --> 00:04:49,140 If it has a negative weight for some particular word, 93 00:04:49,140 --> 00:04:53,070 it means that this word is actually negatively influencing the sentiment. 94 00:04:53,070 --> 00:04:55,895 For example, the word disliked for a mood review. 95 00:04:55,895 --> 00:04:58,830 If words gets its weight near zero, 96 00:04:58,830 --> 00:05:00,930 it actually means that the word is irrelevant. 97 00:05:00,930 --> 00:05:05,723 For example, if you use comma or maybe and or other words, 98 00:05:05,723 --> 00:05:10,340 they would probably be irrelevant for most sentiment classification problems. 99 00:05:10,340 --> 00:05:12,230 Now, to clean this model, you would, of course, 100 00:05:12,230 --> 00:05:15,410 have to use a labeled dataset, 101 00:05:15,410 --> 00:05:18,210 where for each text you're either manually or 102 00:05:18,210 --> 00:05:22,455 similarly got a reference prediction of its sentiment. 103 00:05:22,455 --> 00:05:28,470 But the general idea is just like any other machine learning supervised learning task. 104 00:05:28,470 --> 00:05:31,100 Now, of course, there is one way we could extend 105 00:05:31,100 --> 00:05:34,150 a linear model in any situation when it gets applied. 106 00:05:34,150 --> 00:05:36,859 We could try neural network model. Sorry, a wrong picture. 107 00:05:36,859 --> 00:05:42,067 For example, we could try to make two or three layer dense neural network. 108 00:05:42,067 --> 00:05:45,895 It will take your word frequencies 109 00:05:45,895 --> 00:05:49,560 and then just first compute some kind of intermediate auxiliary features, 110 00:05:49,560 --> 00:05:53,544 then mix those features up and again and again and again until you're satisfied. 111 00:05:53,544 --> 00:05:55,265 Then estimate the probability of output. 112 00:05:55,265 --> 00:06:00,515 The only problem here is that this kind of neural network cognition doesn't actually 113 00:06:00,515 --> 00:06:05,990 solve the main issue of your model because you see in sentiment classification, 114 00:06:05,990 --> 00:06:13,715 you usually don't download the gigabytes of labels, perfectly labeled data. 115 00:06:13,715 --> 00:06:17,420 It's really hard to make people sit 116 00:06:17,420 --> 00:06:21,745 and label sentiments especially if they are not getting paid for that, 117 00:06:21,745 --> 00:06:24,980 which is the case of well hundreds of students in universities. 118 00:06:24,980 --> 00:06:30,400 The idea here is that while the data set is very limited, 119 00:06:30,400 --> 00:06:31,765 your model is actually very rich 120 00:06:31,765 --> 00:06:35,110 because remember those Bag of Words features there's one to 121 00:06:35,110 --> 00:06:38,084 be tens of thousands and not more of them so you would 122 00:06:38,084 --> 00:06:41,945 actually have to learn a very large set of words here. 123 00:06:41,945 --> 00:06:45,460 This approach would work in maybe some cases, 124 00:06:45,460 --> 00:06:49,270 but our main goal here is not to kind of augment 125 00:06:49,270 --> 00:06:51,940 another deep neural network but to try to find 126 00:06:51,940 --> 00:06:55,120 a better way to represent words not just Bag of Words presentation, 127 00:06:55,120 --> 00:06:58,355 not just large kind of one word or a singular vector. 128 00:06:58,355 --> 00:07:04,840 We actually need some kind of better representation not just Bag of Words, 129 00:07:04,840 --> 00:07:09,900 which is 10,000 features long but something more compact. 130 00:07:09,900 --> 00:07:12,040 For each word with some kind of 131 00:07:12,040 --> 00:07:16,400 compact small vector that was capture the relevant information about this word. 132 00:07:16,400 --> 00:07:17,620 Now, what's relevant information is, 133 00:07:17,620 --> 00:07:19,700 of course, up for debate. 134 00:07:19,700 --> 00:07:22,975 In some cases, we would generally appreciate if synonyms 135 00:07:22,975 --> 00:07:26,240 would have similar vectors and antonyms are 136 00:07:26,240 --> 00:07:29,320 just generally symmetrically different words would be far 137 00:07:29,320 --> 00:07:33,380 enough in this representation to say that the models can differentiate between them. 138 00:07:33,380 --> 00:07:34,990 Basically, for a sentiment classification, 139 00:07:34,990 --> 00:07:39,865 I would like to have vectors for liked and enjoyed to be more or less 140 00:07:39,865 --> 00:07:43,890 close to one another but vectors 141 00:07:43,890 --> 00:07:48,865 for likes and dislikes should be far enough for linear model to observe, 142 00:07:48,865 --> 00:07:51,065 to actually notice the difference. 143 00:07:51,065 --> 00:07:54,120 We actually try to solve 144 00:07:54,120 --> 00:07:58,910 this very similar problem before and it was called embeddings or meaningful learning. 145 00:07:58,910 --> 00:08:01,380 Now, we have special purpose methods 146 00:08:01,380 --> 00:08:03,600 like multi-dimensional scaling or t-Distributed Stochastic Neighbor 147 00:08:03,600 --> 00:08:08,760 Embedding that solved something 148 00:08:08,760 --> 00:08:12,900 actually resembles this problem but they solved it for the purpose of visualization. 149 00:08:12,900 --> 00:08:16,440 For example, multidimensional scaling tried to take your original, for example, 150 00:08:16,440 --> 00:08:20,430 images or any high-dimensional data and assign 151 00:08:20,430 --> 00:08:25,770 with two or low-dimensional points so that close vectors, 152 00:08:25,770 --> 00:08:31,270 close images would be assigned to close points and different images. 153 00:08:31,270 --> 00:08:36,275 Well, the ones that have large Euclidean distance in regional space would be far enough. 154 00:08:36,275 --> 00:08:40,660 Now of course, TSNE does lot of different thing and TSNE, 155 00:08:40,660 --> 00:08:42,910 this actual thing here on the slide. 156 00:08:42,910 --> 00:08:46,690 But even regardless of what methods we use from this list, 157 00:08:46,690 --> 00:08:49,750 the problem here is that for word embedding say if we want to kind 158 00:08:49,750 --> 00:08:53,650 of embed words into a small compact representation, 159 00:08:53,650 --> 00:08:56,049 one be able to use those methods. 160 00:08:56,049 --> 00:09:01,420 You won't be able to use them without changing how they work. 161 00:09:01,420 --> 00:09:06,120 The problem here is that while for images it's more or less okay, 162 00:09:06,120 --> 00:09:08,665 not exactly natural but more or less appropriate to use 163 00:09:08,665 --> 00:09:13,150 the pixelwise squared error as distance like the pixelwise Euclidean distance. 164 00:09:13,150 --> 00:09:15,565 For words, this trick won't work because 165 00:09:15,565 --> 00:09:18,535 remember if you use kind of Bag of Words representation, 166 00:09:18,535 --> 00:09:22,047 one word is just sort of a one word vector with 167 00:09:22,047 --> 00:09:26,895 10,000th elements but only one of those elements is a true one, there is zero. 168 00:09:26,895 --> 00:09:28,820 If we would compute Euclidean distance, 169 00:09:28,820 --> 00:09:30,411 it will be either one, 170 00:09:30,411 --> 00:09:32,870 square root of two I believe. 171 00:09:32,870 --> 00:09:35,626 In case that the word is different or zero, 172 00:09:35,626 --> 00:09:39,795 it gives your comparing the distance between the words and itself. 173 00:09:39,795 --> 00:09:42,535 We actually, you need some better way like here. 174 00:09:42,535 --> 00:09:46,050 We need some better way to define what does it mean to be similar. 175 00:09:46,050 --> 00:09:49,555 What does that mean that the words should have kind of similar representations? 176 00:09:49,555 --> 00:09:51,820 To answer this problem, 177 00:09:51,820 --> 00:09:58,380 let's look into the popular model of Word2vec and its kind of a embedding family.