1
00:00:02,750 --> 00:00:06,705
We've just covered yet another field of machine learning,

2
00:00:06,705 --> 00:00:10,890
the deep unsupervised learning with autoencoders and all that kind of stuff.

3
00:00:10,890 --> 00:00:14,265
To get a full collection, let's now cover another kind of application domain.

4
00:00:14,265 --> 00:00:18,610
Mainly, let's talk about how do we apply unsupervised learning to text data.

5
00:00:18,610 --> 00:00:23,430
A disclaimer though, we won't have any time to cover text in depth.

6
00:00:23,430 --> 00:00:26,310
However, you'll be able to get much more information about them

7
00:00:26,310 --> 00:00:30,330
in the appropriate course in our specialization.

8
00:00:30,330 --> 00:00:31,778
What's a text?

9
00:00:31,778 --> 00:00:34,540
Now, there's a lot of linguistic articles describing

10
00:00:34,540 --> 00:00:37,590
text in much more high-level notions, but, for us,

11
00:00:37,590 --> 00:00:40,545
it would be sufficient to say that text is a sequence of

12
00:00:40,545 --> 00:00:44,070
words or a sequence of characters if you don't want to go that deep.

13
00:00:44,070 --> 00:00:47,080
A word or character or some kind of token,

14
00:00:47,080 --> 00:00:48,780
is just an atomic element of text.

15
00:00:48,780 --> 00:00:51,545
Basically, it closes the circle.

16
00:00:51,545 --> 00:00:54,945
Let's just assume that we have a finite number of words,

17
00:00:54,945 --> 00:00:57,630
which is probably untrue.

18
00:00:57,630 --> 00:01:00,960
We consider a text a symbolic sequences of those words.

19
00:01:00,960 --> 00:01:03,330
The usual way you handle text is,

20
00:01:03,330 --> 00:01:04,680
just you apply some filtering,

21
00:01:04,680 --> 00:01:07,350
some pre-processing like a string of characters

22
00:01:07,350 --> 00:01:12,120
basically and the first thing you do is you filter out some of them,

23
00:01:12,120 --> 00:01:13,965
the time delimiter to your problem.

24
00:01:13,965 --> 00:01:16,200
For example, if you're trying to, for example,

25
00:01:16,200 --> 00:01:20,610
predicts the emotional sentiment of a movie reviewer,

26
00:01:20,610 --> 00:01:23,040
then sometimes you can ignore the punctuation signs

27
00:01:23,040 --> 00:01:25,370
or maybe some kind of brackets and other,

28
00:01:25,370 --> 00:01:28,100
maybe HTML or XML if there are some.

29
00:01:28,100 --> 00:01:31,860
Then, the thing you have to do is you have to split the text into those tokens

30
00:01:31,860 --> 00:01:35,435
so you have some irregular expressions or

31
00:01:35,435 --> 00:01:39,480
maybe some more complicated tokenizer that takes the string

32
00:01:39,480 --> 00:01:44,135
and splits it into a list of entities that belong to a certain dictionary,

33
00:01:44,135 --> 00:01:47,635
a list of all characters or all words for example.

34
00:01:47,635 --> 00:01:51,110
Then you actually need to somehow exact features from

35
00:01:51,110 --> 00:01:55,385
your text in order to make any machine learning models available for you.

36
00:01:55,385 --> 00:01:58,450
The simplest and the most popular way you can do

37
00:01:58,450 --> 00:02:01,608
that is by using the so-called Bag of Words approach.

38
00:02:01,608 --> 00:02:04,890
This approach constructs a vector of features instead vector of counts.

39
00:02:04,890 --> 00:02:06,010
So, for every word,

40
00:02:06,010 --> 00:02:08,720
you basically count how many times does this word present.

41
00:02:08,720 --> 00:02:10,656
For example, in this text here,

42
00:02:10,656 --> 00:02:12,865
the word journal is present three times while the word

43
00:02:12,865 --> 00:02:15,690
learning is not present so it gets zero.

44
00:02:15,690 --> 00:02:17,760
Basically, you go over

45
00:02:17,760 --> 00:02:19,979
all the words you have in dictionary and start this large feature vector.

46
00:02:19,979 --> 00:02:27,315
Now, if you had any experience dealing with computational linguistics,

47
00:02:27,315 --> 00:02:30,000
you probably know that there are many words tens of thousands,

48
00:02:30,000 --> 00:02:31,575
if not hundreds of thousands.

49
00:02:31,575 --> 00:02:35,175
If you include all the possible typos for example,

50
00:02:35,175 --> 00:02:37,620
this list grows even further.

51
00:02:37,620 --> 00:02:42,120
The problem with this approach is although it constructs a feature vector,

52
00:02:42,120 --> 00:02:45,960
it's absolutely ignores the word ordering for example.

53
00:02:45,960 --> 00:02:47,295
If you have word no,

54
00:02:47,295 --> 00:02:52,410
you won't actually be able to restore the position of this word and you won't be able

55
00:02:52,410 --> 00:02:54,900
to understand where the negation was applied in

56
00:02:54,900 --> 00:02:58,350
the sentence so you just know that there is one word no,

57
00:02:58,350 --> 00:03:00,425
and this is a problem.

58
00:03:00,425 --> 00:03:02,204
We will, of course, deal with this problem,

59
00:03:02,204 --> 00:03:03,230
but before we do that,

60
00:03:03,230 --> 00:03:06,060
let's find maybe a way we can use this representation to gain

61
00:03:06,060 --> 00:03:11,600
some efficiency in sentiment classification for example.

62
00:03:11,600 --> 00:03:14,225
This problem of sentiment classification,

63
00:03:14,225 --> 00:03:17,635
it's a problem of trying to predict where a particular text.

64
00:03:17,635 --> 00:03:22,700
For example, a movie review or a tweet was emotionally positive, negative, or neutral.

65
00:03:22,700 --> 00:03:24,770
Of course, this is not the only or

66
00:03:24,770 --> 00:03:27,380
the most important problem in machine learning processing,

67
00:03:27,380 --> 00:03:28,910
but it is an important problem.

68
00:03:28,910 --> 00:03:31,525
For example, if you could predict sentiments efficiently,

69
00:03:31,525 --> 00:03:34,550
you would be able to use this model to survey

70
00:03:34,550 --> 00:03:38,355
the social media for example so you have your new product.

71
00:03:38,355 --> 00:03:43,732
You can just grab all the tweets that mention your product and find what kinds of,

72
00:03:43,732 --> 00:03:48,875
for example, age groups have the most positive opinion of your new product.

73
00:03:48,875 --> 00:03:53,585
Then you can for example get some insight on how to advertise it most efficiently.

74
00:03:53,585 --> 00:03:55,895
Now, this problem, of course,

75
00:03:55,895 --> 00:03:59,315
is not the hardest one but

76
00:03:59,315 --> 00:04:03,300
there's a lot of methods that work across all the machine learning processing.

77
00:04:03,300 --> 00:04:07,085
We now study of those methods applied to text classification or regression.

78
00:04:07,085 --> 00:04:09,752
One popular approach you could try to solve this problem,

79
00:04:09,752 --> 00:04:12,230
you could use your Bag of Words,

80
00:04:12,230 --> 00:04:15,895
word count frequencies as features for any classifier you would like to use.

81
00:04:15,895 --> 00:04:17,880
For example, you could try logistic regression.

82
00:04:17,880 --> 00:04:19,355
For this particular problem,

83
00:04:19,355 --> 00:04:23,450
logistic regression and other linear models are very kind of easy to interpret.

84
00:04:23,450 --> 00:04:25,010
They do a very sensible thing.

85
00:04:25,010 --> 00:04:28,445
What logistic regression does is it takes each word.

86
00:04:28,445 --> 00:04:32,407
Basically, each of the Bag of Words features is a count of a particular word,

87
00:04:32,407 --> 00:04:33,620
and for this word,

88
00:04:33,620 --> 00:04:36,425
it has a special dedicated weight where

89
00:04:36,425 --> 00:04:41,470
large weights correspond to kind of emotionally positive,

90
00:04:41,470 --> 00:04:43,675
hugely positive words like awesome,

91
00:04:43,675 --> 00:04:46,710
liked, enjoyed, if it's a mood review for example.

92
00:04:46,710 --> 00:04:49,140
If it has a negative weight for some particular word,

93
00:04:49,140 --> 00:04:53,070
it means that this word is actually negatively influencing the sentiment.

94
00:04:53,070 --> 00:04:55,895
For example, the word disliked for a mood review.

95
00:04:55,895 --> 00:04:58,830
If words gets its weight near zero,

96
00:04:58,830 --> 00:05:00,930
it actually means that the word is irrelevant.

97
00:05:00,930 --> 00:05:05,723
For example, if you use comma or maybe and or other words,

98
00:05:05,723 --> 00:05:10,340
they would probably be irrelevant for most sentiment classification problems.

99
00:05:10,340 --> 00:05:12,230
Now, to clean this model, you would, of course,

100
00:05:12,230 --> 00:05:15,410
have to use a labeled dataset,

101
00:05:15,410 --> 00:05:18,210
where for each text you're either manually or

102
00:05:18,210 --> 00:05:22,455
similarly got a reference prediction of its sentiment.

103
00:05:22,455 --> 00:05:28,470
But the general idea is just like any other machine learning supervised learning task.

104
00:05:28,470 --> 00:05:31,100
Now, of course, there is one way we could extend

105
00:05:31,100 --> 00:05:34,150
a linear model in any situation when it gets applied.

106
00:05:34,150 --> 00:05:36,859
We could try neural network model. Sorry, a wrong picture.

107
00:05:36,859 --> 00:05:42,067
For example, we could try to make two or three layer dense neural network.

108
00:05:42,067 --> 00:05:45,895
It will take your word frequencies

109
00:05:45,895 --> 00:05:49,560
and then just first compute some kind of intermediate auxiliary features,

110
00:05:49,560 --> 00:05:53,544
then mix those features up and again and again and again until you're satisfied.

111
00:05:53,544 --> 00:05:55,265
Then estimate the probability of output.

112
00:05:55,265 --> 00:06:00,515
The only problem here is that this kind of neural network cognition doesn't actually

113
00:06:00,515 --> 00:06:05,990
solve the main issue of your model because you see in sentiment classification,

114
00:06:05,990 --> 00:06:13,715
you usually don't download the gigabytes of labels, perfectly labeled data.

115
00:06:13,715 --> 00:06:17,420
It's really hard to make people sit

116
00:06:17,420 --> 00:06:21,745
and label sentiments especially if they are not getting paid for that,

117
00:06:21,745 --> 00:06:24,980
which is the case of well hundreds of students in universities.

118
00:06:24,980 --> 00:06:30,400
The idea here is that while the data set is very limited,

119
00:06:30,400 --> 00:06:31,765
your model is actually very rich

120
00:06:31,765 --> 00:06:35,110
because remember those Bag of Words features there's one to

121
00:06:35,110 --> 00:06:38,084
be tens of thousands and not more of them so you would

122
00:06:38,084 --> 00:06:41,945
actually have to learn a very large set of words here.

123
00:06:41,945 --> 00:06:45,460
This approach would work in maybe some cases,

124
00:06:45,460 --> 00:06:49,270
but our main goal here is not to kind of augment

125
00:06:49,270 --> 00:06:51,940
another deep neural network but to try to find

126
00:06:51,940 --> 00:06:55,120
a better way to represent words not just Bag of Words presentation,

127
00:06:55,120 --> 00:06:58,355
not just large kind of one word or a singular vector.

128
00:06:58,355 --> 00:07:04,840
We actually need some kind of better representation not just Bag of Words,

129
00:07:04,840 --> 00:07:09,900
which is 10,000 features long but something more compact.

130
00:07:09,900 --> 00:07:12,040
For each word with some kind of

131
00:07:12,040 --> 00:07:16,400
compact small vector that was capture the relevant information about this word.

132
00:07:16,400 --> 00:07:17,620
Now, what's relevant information is,

133
00:07:17,620 --> 00:07:19,700
of course, up for debate.

134
00:07:19,700 --> 00:07:22,975
In some cases, we would generally appreciate if synonyms

135
00:07:22,975 --> 00:07:26,240
would have similar vectors and antonyms are

136
00:07:26,240 --> 00:07:29,320
just generally symmetrically different words would be far

137
00:07:29,320 --> 00:07:33,380
enough in this representation to say that the models can differentiate between them.

138
00:07:33,380 --> 00:07:34,990
Basically, for a sentiment classification,

139
00:07:34,990 --> 00:07:39,865
I would like to have vectors for liked and enjoyed to be more or less

140
00:07:39,865 --> 00:07:43,890
close to one another but vectors

141
00:07:43,890 --> 00:07:48,865
for likes and dislikes should be far enough for linear model to observe,

142
00:07:48,865 --> 00:07:51,065
to actually notice the difference.

143
00:07:51,065 --> 00:07:54,120
We actually try to solve

144
00:07:54,120 --> 00:07:58,910
this very similar problem before and it was called embeddings or meaningful learning.

145
00:07:58,910 --> 00:08:01,380
Now, we have special purpose methods

146
00:08:01,380 --> 00:08:03,600
like multi-dimensional scaling or t-Distributed Stochastic Neighbor

147
00:08:03,600 --> 00:08:08,760
Embedding that solved something

148
00:08:08,760 --> 00:08:12,900
actually resembles this problem but they solved it for the purpose of visualization.

149
00:08:12,900 --> 00:08:16,440
For example, multidimensional scaling tried to take your original, for example,

150
00:08:16,440 --> 00:08:20,430
images or any high-dimensional data and assign

151
00:08:20,430 --> 00:08:25,770
with two or low-dimensional points so that close vectors,

152
00:08:25,770 --> 00:08:31,270
close images would be assigned to close points and different images.

153
00:08:31,270 --> 00:08:36,275
Well, the ones that have large Euclidean distance in regional space would be far enough.

154
00:08:36,275 --> 00:08:40,660
Now of course, TSNE does lot of different thing and TSNE,

155
00:08:40,660 --> 00:08:42,910
this actual thing here on the slide.

156
00:08:42,910 --> 00:08:46,690
But even regardless of what methods we use from this list,

157
00:08:46,690 --> 00:08:49,750
the problem here is that for word embedding say if we want to kind

158
00:08:49,750 --> 00:08:53,650
of embed words into a small compact representation,

159
00:08:53,650 --> 00:08:56,049
one be able to use those methods.

160
00:08:56,049 --> 00:09:01,420
You won't be able to use them without changing how they work.

161
00:09:01,420 --> 00:09:06,120
The problem here is that while for images it's more or less okay,

162
00:09:06,120 --> 00:09:08,665
not exactly natural but more or less appropriate to use

163
00:09:08,665 --> 00:09:13,150
the pixelwise squared error as distance like the pixelwise Euclidean distance.

164
00:09:13,150 --> 00:09:15,565
For words, this trick won't work because

165
00:09:15,565 --> 00:09:18,535
remember if you use kind of Bag of Words representation,

166
00:09:18,535 --> 00:09:22,047
one word is just sort of a one word vector with

167
00:09:22,047 --> 00:09:26,895
10,000th elements but only one of those elements is a true one, there is zero.

168
00:09:26,895 --> 00:09:28,820
If we would compute Euclidean distance,

169
00:09:28,820 --> 00:09:30,411
it will be either one,

170
00:09:30,411 --> 00:09:32,870
square root of two I believe.

171
00:09:32,870 --> 00:09:35,626
In case that the word is different or zero,

172
00:09:35,626 --> 00:09:39,795
it gives your comparing the distance between the words and itself.

173
00:09:39,795 --> 00:09:42,535
We actually, you need some better way like here.

174
00:09:42,535 --> 00:09:46,050
We need some better way to define what does it mean to be similar.

175
00:09:46,050 --> 00:09:49,555
What does that mean that the words should have kind of similar representations?

176
00:09:49,555 --> 00:09:51,820
To answer this problem,

177
00:09:51,820 --> 00:09:58,380
let's look into the popular model of Word2vec and its kind of a embedding family.