1
00:00:00,025 --> 00:00:03,770
[MUSIC]

2
00:00:03,770 --> 00:00:05,280
Hi.

3
00:00:05,280 --> 00:00:09,410
Often in computations,
we have data like text and images.

4
00:00:09,410 --> 00:00:14,110
If you have only them, we can apply
approach specific for this type of data.

5
00:00:14,110 --> 00:00:19,350
For example, we can use search
engines in order to find similar text.

6
00:00:19,350 --> 00:00:23,140
That was the case in
the Allen AI Challenge for example.

7
00:00:23,140 --> 00:00:26,928
For images, on the other hand,
we can use conditional neural networks,

8
00:00:26,928 --> 00:00:31,550
like in the Data Science Bowl, and
a whole bunch of other competitions.

9
00:00:31,550 --> 00:00:33,350
But if we have text or

10
00:00:33,350 --> 00:00:38,340
images as additional data, we usually
must grasp different features, which

11
00:00:38,340 --> 00:00:44,030
can be edited as complementary to our
main data frame of samples and features.

12
00:00:44,030 --> 00:00:48,353
Very simple example of such case we can
see in the Titanic dataset we have called

13
00:00:48,353 --> 00:00:50,710
name, which is more or less like text, and

14
00:00:50,710 --> 00:00:54,195
to use it, we first need to derive
the useful features from it.

15
00:00:54,195 --> 00:00:58,731
Another most surest example,
we can predict whether a pair of online

16
00:00:58,731 --> 00:01:04,050
advertisements are duplicates, like
slighty different copies of each other,

17
00:01:04,050 --> 00:01:09,056
and we could have images from these
advertisements as complimentary data,

18
00:01:09,056 --> 00:01:12,754
like the Avito Duplicates Ads Detection
competition.

19
00:01:12,754 --> 00:01:15,560
Or you may be given the task
of classifying documents,

20
00:01:15,560 --> 00:01:18,490
like in the Tradeshift Text
Classification Challenge.

21
00:01:19,870 --> 00:01:24,290
When feature extraction is done, we can
treat extracted features differently.

22
00:01:24,290 --> 00:01:28,510
Sometimes we just want to add new
features to existing dataframe.

23
00:01:28,510 --> 00:01:32,753
Sometimes we even might want to use the
right features independently, and in end,

24
00:01:32,753 --> 00:01:34,702
make stake in with the base solution.

25
00:01:34,702 --> 00:01:39,660
We will go through stake in and we will
learn how to apply it later in the topic

26
00:01:39,660 --> 00:01:45,013
about ensembles, but for now, you should
know that both ways first to acquire,

27
00:01:45,013 --> 00:01:49,050
to of course extract features
from text and images somehow.

28
00:01:49,050 --> 00:01:52,280
And this is exactly what we
will discuss in this video.

29
00:01:53,510 --> 00:01:56,760
Let's start with featured
extraction from text.

30
00:01:56,760 --> 00:01:58,646
There are two main ways to do this.

31
00:01:58,646 --> 00:02:02,002
First is to apply bag of words,
and second,

32
00:02:02,002 --> 00:02:05,720
use embeddings like word to vector.

33
00:02:05,720 --> 00:02:09,840
Now, we'll talk about a bit
about each of these methods, and

34
00:02:09,840 --> 00:02:14,140
in addition, we will go through text
pre-processings related to them.

35
00:02:15,500 --> 00:02:19,350
Let's start with the first approach,
the simplest one, bag of words.

36
00:02:20,410 --> 00:02:23,010
Here we create new column for

37
00:02:23,010 --> 00:02:28,480
each unique word from the data, then we
simply count number of occurences for

38
00:02:28,480 --> 00:02:32,460
each word, and place this value
in the appropriate column.

39
00:02:32,460 --> 00:02:34,310
After applying the separation to each row,

40
00:02:34,310 --> 00:02:39,210
we will have usual dataframe
of samples and features.

41
00:02:39,210 --> 00:02:42,305
In a scalar,
this can be done with CountVectorizer.

42
00:02:42,305 --> 00:02:48,940
We also can post process calculated
metrics using some pre-defined methods.

43
00:02:48,940 --> 00:02:55,400
To make out why we need post-processing
let's remember that some models like kNN,

44
00:02:55,400 --> 00:03:00,730
like neural regression, and neural
networks, depend on scaling of features.

45
00:03:00,730 --> 00:03:03,280
So the main goal of post-processing here

46
00:03:03,280 --> 00:03:07,100
is to make samples more comparable
on one side, and on the other,

47
00:03:07,100 --> 00:03:11,360
boost more important features while
decreasing the scale of useless ones.

48
00:03:12,480 --> 00:03:16,310
One way to achieve the first goal
of making a sample small comparable

49
00:03:16,310 --> 00:03:19,530
is to normalize sum of values in a row.

50
00:03:19,530 --> 00:03:25,130
In this way, we will count not
occurrences but frequencies of words.

51
00:03:25,130 --> 00:03:29,330
Thus, texts of different sizes
will be more comparable.

52
00:03:29,330 --> 00:03:32,640
This is the exact purpose of
term frequency transformation.

53
00:03:34,012 --> 00:03:38,080
To achieve the second goal,
that is to boost more important features,

54
00:03:38,080 --> 00:03:43,590
we'll make post process our matrix
by normalizing data column wise.

55
00:03:43,590 --> 00:03:48,810
A good idea is to normalize each feature
by the inverse fraction of documents,

56
00:03:48,810 --> 00:03:51,740
which contain the exact word
corresponding to this feature.

57
00:03:52,840 --> 00:03:56,210
In this case,
features corresponding to frequent words

58
00:03:56,210 --> 00:03:59,960
will be scaled down compared to
features corresponding to rarer words.

59
00:04:01,090 --> 00:04:05,289
We can further improve this idea
by taking a logarithm of these

60
00:04:05,289 --> 00:04:07,220
numberization coefficients.

61
00:04:07,220 --> 00:04:11,410
As a result, this will decrease
the significance of widespread words in

62
00:04:11,410 --> 00:04:15,110
the dataset and
do require feature scaling.

63
00:04:15,110 --> 00:04:18,540
This is the purpose of inverse
document frequency transformation.

64
00:04:19,660 --> 00:04:24,631
General frequency, and inverse
document frequency transformations,

65
00:04:24,631 --> 00:04:30,050
are often used together,
like an sklearn, in TFiDF Vectorizer.

66
00:04:30,050 --> 00:04:34,070
Let's apply TFiDF transformation
to the previous example.

67
00:04:34,070 --> 00:04:36,180
First, TF.

68
00:04:36,180 --> 00:04:36,833
Nice.

69
00:04:36,833 --> 00:04:39,652
Occurences which
are switched to frequencies,

70
00:04:39,652 --> 00:04:43,108
that means some of variance for
each row is now equal to one.

71
00:04:43,108 --> 00:04:44,019
Now, IDF, great.

72
00:04:44,019 --> 00:04:48,640
Now data is normalized column wise,
and you can see, for

73
00:04:48,640 --> 00:04:51,335
those of you who are too excited,

74
00:04:51,335 --> 00:04:56,160
IDF transformation scaled
down the appropriate feature.

75
00:04:57,470 --> 00:05:01,970
It's worth mentioning that there
are plenty of other variants of TFiDF

76
00:05:01,970 --> 00:05:05,429
which may work better depending
on the specific data.

77
00:05:05,429 --> 00:05:08,860
Another very useful technique is Ngrams.

78
00:05:08,860 --> 00:05:13,710
The concept of Ngram is simple, you add
not only column corresponding to the word,

79
00:05:13,710 --> 00:05:17,340
but also columns corresponding
to inconsequent words.

80
00:05:17,340 --> 00:05:21,092
This concept can also be applied
to sequence of chars, and

81
00:05:21,092 --> 00:05:26,465
in cases with low N, we'll have a column
for each possible combination of N chars.

82
00:05:26,465 --> 00:05:30,840
As we can see, for N = 1, number of
these columns will be equal to 28.

83
00:05:30,840 --> 00:05:36,220
Let's calculate number of
these columns for N = 2.

84
00:05:36,220 --> 00:05:38,890
Well, it will be 28 squared.

85
00:05:38,890 --> 00:05:43,681
Note that sometimes it can be cheaper
to have every possible char Ngram as

86
00:05:43,681 --> 00:05:48,876
a feature, instead of having a feature for
each unique word from the dataset.

87
00:05:48,876 --> 00:05:54,260
Using char Ngrams also helps our
model to handle unseen words.

88
00:05:54,260 --> 00:05:57,389
For example,
rare forms of already used words.

89
00:05:58,640 --> 00:06:02,999
In a scalared count vectorizor
has appropriate parameter for

90
00:06:02,999 --> 00:06:06,140
using Ngrams, it is called Ngram_range.

91
00:06:06,140 --> 00:06:12,598
To change from word Ngrams to char Ngrams,
you may use parameter named analyzer.

92
00:06:12,598 --> 00:06:18,020
Usually, you may want to preprocess text,
even before applying bag of words,

93
00:06:18,020 --> 00:06:22,860
and sometimes, careful text preprocessing
can help bag of words drastically.

94
00:06:23,940 --> 00:06:28,030
Here, we will discuss such methods
as converting text to lowercase,

95
00:06:28,030 --> 00:06:31,320
lemmatization, stemming,
and the usage of stopwords.

96
00:06:32,330 --> 00:06:36,125
Let's consider simple example
which shows utility of lowercase.

97
00:06:36,125 --> 00:06:41,500
What if we applied bag of words
to the sentence very, very sunny?

98
00:06:41,500 --> 00:06:44,440
We will get three columns for each word.

99
00:06:44,440 --> 00:06:50,020
So because Very, with capital letter,
is not the same string as very without it,

100
00:06:50,020 --> 00:06:53,300
we will get multiple columns for
the same word, and

101
00:06:53,300 --> 00:06:57,710
again, Sunny with capital letter
doesn't match sunny without it.

102
00:06:57,710 --> 00:07:03,106
So, first preprocessing what we want to
do is to apply lowercase to our text.

103
00:07:03,106 --> 00:07:08,740
Fortunately, configurizer from
sklearn does this by default.

104
00:07:09,960 --> 00:07:12,950
Now, let's move on to lemmatization and
stemming.

105
00:07:14,010 --> 00:07:17,870
These methods refer to more
advanced preprocessing.

106
00:07:17,870 --> 00:07:19,540
Let's look at this example.

107
00:07:19,540 --> 00:07:23,500
We have two sentences: I had a car,
and We have cars.

108
00:07:24,560 --> 00:07:30,580
We may want to unify the words car and
cars, which are basically the same word.

109
00:07:30,580 --> 00:07:33,190
The same goes for had and have, and so on.

110
00:07:34,270 --> 00:07:38,525
Both stemming and lemmatization may
be used to fulfill this purpose, but

111
00:07:38,525 --> 00:07:40,730
they achieve this in different ways.

112
00:07:41,930 --> 00:07:46,867
Stemming usually refers to a heuristic
process that chops off ending of words and

113
00:07:46,867 --> 00:07:51,296
thus unite duration of related words
like democracy, democratic, and

114
00:07:51,296 --> 00:07:57,250
democratization, producing something like,
democr, for each of these words.

115
00:07:57,250 --> 00:08:02,200
Lemmatization, on the hand, usually means
that you have want to do this carefully

116
00:08:02,200 --> 00:08:06,200
using knowledge or vocabulary, and
morphological analogies of force,

117
00:08:06,200 --> 00:08:08,910
returning democracy for
each of the words below.

118
00:08:10,080 --> 00:08:14,070
Let's look at another example that shows
the difference between stemming and

119
00:08:14,070 --> 00:08:16,844
lemmatization by applying
them to word saw.

120
00:08:16,844 --> 00:08:21,950
While stemming will return on
the letter s, lemmatization will

121
00:08:21,950 --> 00:08:27,080
try to return either see or saw,
dependent on the word's meaning.

122
00:08:27,080 --> 00:08:30,340
The last technique for text preprocessing,

123
00:08:30,340 --> 00:08:34,370
which we will discuss here,
is usage of stopwords.

124
00:08:34,370 --> 00:08:39,430
Basically, stopwords are words which do
not contain important information for

125
00:08:39,430 --> 00:08:40,630
our model.

126
00:08:40,630 --> 00:08:45,480
They are either insignificant
like articles or prepositions, or

127
00:08:45,480 --> 00:08:48,860
so common they do not
help to solve our task.

128
00:08:48,860 --> 00:08:53,280
Most languages have predefined list
of stopwords which can be found

129
00:08:53,280 --> 00:08:56,040
on the Internet or logged from NLTK,

130
00:08:56,040 --> 00:09:00,378
which stands for Natural Language
Toolkit Library for Python.

131
00:09:00,378 --> 00:09:06,229
CountVectorizer from sklearn also
has parameter related to stopwords,

132
00:09:06,229 --> 00:09:08,321
which is called max_df.

133
00:09:08,321 --> 00:09:11,796
max_df is the threshold
of words we can see,

134
00:09:11,796 --> 00:09:16,837
after we see in which,
the word will be removed from text corpus.

135
00:09:16,837 --> 00:09:23,680
Good, we just have discussed classical
feature extraction pipeline for text.

136
00:09:23,680 --> 00:09:27,420
At the beginning,
we may want to pre-process our text.

137
00:09:27,420 --> 00:09:31,990
To do so, we can apply lowercase,
stemming, lemmatization, or

138
00:09:31,990 --> 00:09:33,880
remove stopwords.

139
00:09:33,880 --> 00:09:38,980
After preprocessing, we can use bag
of words approach to get the matrix

140
00:09:38,980 --> 00:09:44,060
where each row represents a text, and
each column represents a unique word.

141
00:09:45,130 --> 00:09:49,393
Also, we can use bag of words approach for
Ngrams, and

142
00:09:49,393 --> 00:09:54,504
in new columns for groups of
several consecutive words or chars.

143
00:09:54,504 --> 00:09:59,381
And in the end, when we postprocess
these metrics using TFiDF,

144
00:09:59,381 --> 00:10:01,830
which often prove to be useful.

145
00:10:03,190 --> 00:10:08,310
Well, then now we can add extracted
features to our basic data frame,

146
00:10:08,310 --> 00:10:12,350
or putting the dependent model on
them to create some tricky features.

147
00:10:13,360 --> 00:10:14,730
That's all for now.

148
00:10:14,730 --> 00:10:19,200
In the next video, we will continue
to discuss feature extraction.

149
00:10:19,200 --> 00:10:21,690
We'll go through two big points.

150
00:10:21,690 --> 00:10:24,704
First, we'll talk about approach for

151
00:10:24,704 --> 00:10:29,160
texts, and second, we will discuss
feature extraction for images.

152
00:10:29,160 --> 00:10:35,070
[MUSIC]