[MUSIC] Hi. Often in computations,
we have data like text and images. If you have only them, we can apply
approach specific for this type of data. For example, we can use search
engines in order to find similar text. That was the case in
the Allen AI Challenge for example. For images, on the other hand,
we can use conditional neural networks, like in the Data Science Bowl, and
a whole bunch of other competitions. But if we have text or images as additional data, we usually
must grasp different features, which can be edited as complementary to our
main data frame of samples and features. Very simple example of such case we can
see in the Titanic dataset we have called name, which is more or less like text, and to use it, we first need to derive
the useful features from it. Another most surest example,
we can predict whether a pair of online advertisements are duplicates, like
slighty different copies of each other, and we could have images from these
advertisements as complimentary data, like the Avito Duplicates Ads Detection
competition. Or you may be given the task
of classifying documents, like in the Tradeshift Text
Classification Challenge. When feature extraction is done, we can
treat extracted features differently. Sometimes we just want to add new
features to existing dataframe. Sometimes we even might want to use the
right features independently, and in end, make stake in with the base solution. We will go through stake in and we will
learn how to apply it later in the topic about ensembles, but for now, you should
know that both ways first to acquire, to of course extract features
from text and images somehow. And this is exactly what we
will discuss in this video. Let's start with featured
extraction from text. There are two main ways to do this. First is to apply bag of words,
and second, use embeddings like word to vector. Now, we'll talk about a bit
about each of these methods, and in addition, we will go through text
pre-processings related to them. Let's start with the first approach,
the simplest one, bag of words. Here we create new column for each unique word from the data, then we
simply count number of occurences for each word, and place this value
in the appropriate column. After applying the separation to each row, we will have usual dataframe
of samples and features. In a scalar,
this can be done with CountVectorizer. We also can post process calculated
metrics using some pre-defined methods. To make out why we need post-processing
let's remember that some models like kNN, like neural regression, and neural
networks, depend on scaling of features. So the main goal of post-processing here is to make samples more comparable
on one side, and on the other, boost more important features while
decreasing the scale of useless ones. One way to achieve the first goal
of making a sample small comparable is to normalize sum of values in a row. In this way, we will count not
occurrences but frequencies of words. Thus, texts of different sizes
will be more comparable. This is the exact purpose of
term frequency transformation. To achieve the second goal,
that is to boost more important features, we'll make post process our matrix
by normalizing data column wise. A good idea is to normalize each feature
by the inverse fraction of documents, which contain the exact word
corresponding to this feature. In this case,
features corresponding to frequent words will be scaled down compared to
features corresponding to rarer words. We can further improve this idea
by taking a logarithm of these numberization coefficients. As a result, this will decrease
the significance of widespread words in the dataset and
do require feature scaling. This is the purpose of inverse
document frequency transformation. General frequency, and inverse
document frequency transformations, are often used together,
like an sklearn, in TFiDF Vectorizer. Let's apply TFiDF transformation
to the previous example. First, TF. Nice. Occurences which
are switched to frequencies, that means some of variance for
each row is now equal to one. Now, IDF, great. Now data is normalized column wise,
and you can see, for those of you who are too excited, IDF transformation scaled
down the appropriate feature. It's worth mentioning that there
are plenty of other variants of TFiDF which may work better depending
on the specific data. Another very useful technique is Ngrams. The concept of Ngram is simple, you add
not only column corresponding to the word, but also columns corresponding
to inconsequent words. This concept can also be applied
to sequence of chars, and in cases with low N, we'll have a column
for each possible combination of N chars. As we can see, for N = 1, number of
these columns will be equal to 28. Let's calculate number of
these columns for N = 2. Well, it will be 28 squared. Note that sometimes it can be cheaper
to have every possible char Ngram as a feature, instead of having a feature for
each unique word from the dataset. Using char Ngrams also helps our
model to handle unseen words. For example,
rare forms of already used words. In a scalared count vectorizor
has appropriate parameter for using Ngrams, it is called Ngram_range. To change from word Ngrams to char Ngrams,
you may use parameter named analyzer. Usually, you may want to preprocess text,
even before applying bag of words, and sometimes, careful text preprocessing
can help bag of words drastically. Here, we will discuss such methods
as converting text to lowercase, lemmatization, stemming,
and the usage of stopwords. Let's consider simple example
which shows utility of lowercase. What if we applied bag of words
to the sentence very, very sunny? We will get three columns for each word. So because Very, with capital letter,
is not the same string as very without it, we will get multiple columns for
the same word, and again, Sunny with capital letter
doesn't match sunny without it. So, first preprocessing what we want to
do is to apply lowercase to our text. Fortunately, configurizer from
sklearn does this by default. Now, let's move on to lemmatization and
stemming. These methods refer to more
advanced preprocessing. Let's look at this example. We have two sentences: I had a car,
and We have cars. We may want to unify the words car and
cars, which are basically the same word. The same goes for had and have, and so on. Both stemming and lemmatization may
be used to fulfill this purpose, but they achieve this in different ways. Stemming usually refers to a heuristic
process that chops off ending of words and thus unite duration of related words
like democracy, democratic, and democratization, producing something like,
democr, for each of these words. Lemmatization, on the hand, usually means
that you have want to do this carefully using knowledge or vocabulary, and
morphological analogies of force, returning democracy for
each of the words below. Let's look at another example that shows
the difference between stemming and lemmatization by applying
them to word saw. While stemming will return on
the letter s, lemmatization will try to return either see or saw,
dependent on the word's meaning. The last technique for text preprocessing, which we will discuss here,
is usage of stopwords. Basically, stopwords are words which do
not contain important information for our model. They are either insignificant
like articles or prepositions, or so common they do not
help to solve our task. Most languages have predefined list
of stopwords which can be found on the Internet or logged from NLTK, which stands for Natural Language
Toolkit Library for Python. CountVectorizer from sklearn also
has parameter related to stopwords, which is called max_df. max_df is the threshold
of words we can see, after we see in which,
the word will be removed from text corpus. Good, we just have discussed classical
feature extraction pipeline for text. At the beginning,
we may want to pre-process our text. To do so, we can apply lowercase,
stemming, lemmatization, or remove stopwords. After preprocessing, we can use bag
of words approach to get the matrix where each row represents a text, and
each column represents a unique word. Also, we can use bag of words approach for
Ngrams, and in new columns for groups of
several consecutive words or chars. And in the end, when we postprocess
these metrics using TFiDF, which often prove to be useful. Well, then now we can add extracted
features to our basic data frame, or putting the dependent model on
them to create some tricky features. That's all for now. In the next video, we will continue
to discuss feature extraction. We'll go through two big points. First, we'll talk about approach for texts, and second, we will discuss
feature extraction for images. [MUSIC]