We've just covered yet another field of machine learning, the deep unsupervised learning with autoencoders and all that kind of stuff. To get a full collection, let's now cover another kind of application domain. Mainly, let's talk about how do we apply unsupervised learning to text data. A disclaimer though, we won't have any time to cover text in depth. However, you'll be able to get much more information about them in the appropriate course in our specialization. What's a text? Now, there's a lot of linguistic articles describing text in much more high-level notions, but, for us, it would be sufficient to say that text is a sequence of words or a sequence of characters if you don't want to go that deep. A word or character or some kind of token, is just an atomic element of text. Basically, it closes the circle. Let's just assume that we have a finite number of words, which is probably untrue. We consider a text a symbolic sequences of those words. The usual way you handle text is, just you apply some filtering, some pre-processing like a string of characters basically and the first thing you do is you filter out some of them, the time delimiter to your problem. For example, if you're trying to, for example, predicts the emotional sentiment of a movie reviewer, then sometimes you can ignore the punctuation signs or maybe some kind of brackets and other, maybe HTML or XML if there are some. Then, the thing you have to do is you have to split the text into those tokens so you have some irregular expressions or maybe some more complicated tokenizer that takes the string and splits it into a list of entities that belong to a certain dictionary, a list of all characters or all words for example. Then you actually need to somehow exact features from your text in order to make any machine learning models available for you. The simplest and the most popular way you can do that is by using the so-called Bag of Words approach. This approach constructs a vector of features instead vector of counts. So, for every word, you basically count how many times does this word present. For example, in this text here, the word journal is present three times while the word learning is not present so it gets zero. Basically, you go over all the words you have in dictionary and start this large feature vector. Now, if you had any experience dealing with computational linguistics, you probably know that there are many words tens of thousands, if not hundreds of thousands. If you include all the possible typos for example, this list grows even further. The problem with this approach is although it constructs a feature vector, it's absolutely ignores the word ordering for example. If you have word no, you won't actually be able to restore the position of this word and you won't be able to understand where the negation was applied in the sentence so you just know that there is one word no, and this is a problem. We will, of course, deal with this problem, but before we do that, let's find maybe a way we can use this representation to gain some efficiency in sentiment classification for example. This problem of sentiment classification, it's a problem of trying to predict where a particular text. For example, a movie review or a tweet was emotionally positive, negative, or neutral. Of course, this is not the only or the most important problem in machine learning processing, but it is an important problem. For example, if you could predict sentiments efficiently, you would be able to use this model to survey the social media for example so you have your new product. You can just grab all the tweets that mention your product and find what kinds of, for example, age groups have the most positive opinion of your new product. Then you can for example get some insight on how to advertise it most efficiently. Now, this problem, of course, is not the hardest one but there's a lot of methods that work across all the machine learning processing. We now study of those methods applied to text classification or regression. One popular approach you could try to solve this problem, you could use your Bag of Words, word count frequencies as features for any classifier you would like to use. For example, you could try logistic regression. For this particular problem, logistic regression and other linear models are very kind of easy to interpret. They do a very sensible thing. What logistic regression does is it takes each word. Basically, each of the Bag of Words features is a count of a particular word, and for this word, it has a special dedicated weight where large weights correspond to kind of emotionally positive, hugely positive words like awesome, liked, enjoyed, if it's a mood review for example. If it has a negative weight for some particular word, it means that this word is actually negatively influencing the sentiment. For example, the word disliked for a mood review. If words gets its weight near zero, it actually means that the word is irrelevant. For example, if you use comma or maybe and or other words, they would probably be irrelevant for most sentiment classification problems. Now, to clean this model, you would, of course, have to use a labeled dataset, where for each text you're either manually or similarly got a reference prediction of its sentiment. But the general idea is just like any other machine learning supervised learning task. Now, of course, there is one way we could extend a linear model in any situation when it gets applied. We could try neural network model. Sorry, a wrong picture. For example, we could try to make two or three layer dense neural network. It will take your word frequencies and then just first compute some kind of intermediate auxiliary features, then mix those features up and again and again and again until you're satisfied. Then estimate the probability of output. The only problem here is that this kind of neural network cognition doesn't actually solve the main issue of your model because you see in sentiment classification, you usually don't download the gigabytes of labels, perfectly labeled data. It's really hard to make people sit and label sentiments especially if they are not getting paid for that, which is the case of well hundreds of students in universities. The idea here is that while the data set is very limited, your model is actually very rich because remember those Bag of Words features there's one to be tens of thousands and not more of them so you would actually have to learn a very large set of words here. This approach would work in maybe some cases, but our main goal here is not to kind of augment another deep neural network but to try to find a better way to represent words not just Bag of Words presentation, not just large kind of one word or a singular vector. We actually need some kind of better representation not just Bag of Words, which is 10,000 features long but something more compact. For each word with some kind of compact small vector that was capture the relevant information about this word. Now, what's relevant information is, of course, up for debate. In some cases, we would generally appreciate if synonyms would have similar vectors and antonyms are just generally symmetrically different words would be far enough in this representation to say that the models can differentiate between them. Basically, for a sentiment classification, I would like to have vectors for liked and enjoyed to be more or less close to one another but vectors for likes and dislikes should be far enough for linear model to observe, to actually notice the difference. We actually try to solve this very similar problem before and it was called embeddings or meaningful learning. Now, we have special purpose methods like multi-dimensional scaling or t-Distributed Stochastic Neighbor Embedding that solved something actually resembles this problem but they solved it for the purpose of visualization. For example, multidimensional scaling tried to take your original, for example, images or any high-dimensional data and assign with two or low-dimensional points so that close vectors, close images would be assigned to close points and different images. Well, the ones that have large Euclidean distance in regional space would be far enough. Now of course, TSNE does lot of different thing and TSNE, this actual thing here on the slide. But even regardless of what methods we use from this list, the problem here is that for word embedding say if we want to kind of embed words into a small compact representation, one be able to use those methods. You won't be able to use them without changing how they work. The problem here is that while for images it's more or less okay, not exactly natural but more or less appropriate to use the pixelwise squared error as distance like the pixelwise Euclidean distance. For words, this trick won't work because remember if you use kind of Bag of Words representation, one word is just sort of a one word vector with 10,000th elements but only one of those elements is a true one, there is zero. If we would compute Euclidean distance, it will be either one, square root of two I believe. In case that the word is different or zero, it gives your comparing the distance between the words and itself. We actually, you need some better way like here. We need some better way to define what does it mean to be similar. What does that mean that the words should have kind of similar representations? To answer this problem, let's look into the popular model of Word2vec and its kind of a embedding family.