1 00:00:03,240 --> 00:00:06,335 Hi and welcome back. 2 00:00:06,335 --> 00:00:09,580 In this video, we'll talk about Word2vec approach for 3 00:00:09,580 --> 00:00:14,355 texts and then we'll discuss feature extraction or images. 4 00:00:14,355 --> 00:00:16,615 After we've summarized pipeline for 5 00:00:16,615 --> 00:00:20,235 feature extraction with Bag of Words approach in the previous video, 6 00:00:20,235 --> 00:00:23,250 let's overview another approach, 7 00:00:23,250 --> 00:00:24,843 which is widely known as Word2vec. 8 00:00:24,843 --> 00:00:28,460 Just as the Bag of Words approach, 9 00:00:28,460 --> 00:00:32,600 we want to get vector representations of words and texts, 10 00:00:32,600 --> 00:00:34,980 but now more concise than before. 11 00:00:34,980 --> 00:00:37,925 Word2vec is doing exactly that. 12 00:00:37,925 --> 00:00:42,585 It converts each word to some vector in some sophisticated space, 13 00:00:42,585 --> 00:00:45,665 which usually have several hundred dimensions. 14 00:00:45,665 --> 00:00:47,679 To learn the word embedding, 15 00:00:47,679 --> 00:00:50,985 Word2vec uses nearby words. 16 00:00:50,985 --> 00:00:55,950 Basically, different words, which often are used in the same context, 17 00:00:55,950 --> 00:00:58,915 will be very close in these vectoring representation, 18 00:00:58,915 --> 00:01:02,020 which, of course, will benefit our models. 19 00:01:02,020 --> 00:01:05,070 Furthermore, there are some prominent examples 20 00:01:05,070 --> 00:01:07,860 showing that we can apply basic operations like 21 00:01:07,860 --> 00:01:10,515 addition and subtraction on these vectors 22 00:01:10,515 --> 00:01:13,610 and expect results of such operations to be interpretable. 23 00:01:13,610 --> 00:01:18,330 You should already have seen this example by now somewhere. 24 00:01:18,330 --> 00:01:24,330 Basically, if we calculate differences between the vectors of words queen and king, 25 00:01:24,330 --> 00:01:28,765 and differences between the vectors of words woman and man, 26 00:01:28,765 --> 00:01:33,525 we will find that these differences are very similar to each other. 27 00:01:33,525 --> 00:01:36,930 And, if we try to see this from another perspective, 28 00:01:36,930 --> 00:01:42,450 and subtract the vector of woman from the vector of king and then and the vector of man, 29 00:01:42,450 --> 00:01:46,140 will pretty much again the vector of the word queen. 30 00:01:46,140 --> 00:01:48,135 Think about it for a moment. 31 00:01:48,135 --> 00:01:51,715 This is fascinating fact and indeed creation of 32 00:01:51,715 --> 00:01:57,470 Word2vec approach led to many extensive and far reaching results in the field. 33 00:01:57,470 --> 00:01:59,370 There are several implementations of 34 00:01:59,370 --> 00:02:03,060 this embedding approach besides Word2vec namely Glove, 35 00:02:03,060 --> 00:02:06,225 which stands for Global Vector for word representation. 36 00:02:06,225 --> 00:02:09,100 FastText and few others. 37 00:02:09,100 --> 00:02:16,270 Complications may occur, if we need to derive vectors not for words but for sentences. 38 00:02:16,270 --> 00:02:18,800 Here, we may take different approaches. 39 00:02:18,800 --> 00:02:22,610 For example, we can calculate mean or sum of 40 00:02:22,610 --> 00:02:28,318 words vectors or we can choose another way and go with special models like Doc2vec. 41 00:02:28,318 --> 00:02:33,320 Choice all the way to proceed here depends on and particular situation. 42 00:02:33,320 --> 00:02:37,490 Usually, it is better to check both approaches and select the best. 43 00:02:37,490 --> 00:02:42,265 Training of Word2vec can take quite a long time, 44 00:02:42,265 --> 00:02:45,670 and if you work with text or some common origin, 45 00:02:45,670 --> 00:02:49,155 you may find useful pre-trained models on the internet. 46 00:02:49,155 --> 00:02:52,620 For example, ones which are trained on the Wikipedia. 47 00:02:52,620 --> 00:02:55,570 Otherwise, remember, 48 00:02:55,570 --> 00:02:59,737 the training of Word2vec doesn't require target values from your text. 49 00:02:59,737 --> 00:03:05,045 It only requires text to extract context for each word. 50 00:03:05,045 --> 00:03:08,675 Note, that all pre-processing we had discussed earlier, 51 00:03:08,675 --> 00:03:12,145 namely lowercase stemming, lemmatization, 52 00:03:12,145 --> 00:03:18,980 and the usage of stopwords can be applied to text before training Word2vec models. 53 00:03:18,980 --> 00:03:22,310 Now, we're ready to summarize difference between Bag of 54 00:03:22,310 --> 00:03:26,710 Words and the Word2vec approaches in the context of competition. 55 00:03:26,710 --> 00:03:31,950 With Bag of Words, vectors are quite large but is a nice benefit. 56 00:03:31,950 --> 00:03:34,805 Meaning of each value in the vector is known. 57 00:03:34,805 --> 00:03:38,756 With Word2vec, vectors have relatively small length 58 00:03:38,756 --> 00:03:43,220 but values in a vector can be interpreted only in some cases, 59 00:03:43,220 --> 00:03:46,890 which sometimes can be seen as a downside. 60 00:03:46,890 --> 00:03:51,560 The other advantage of Word2vec is crucial in competitions, 61 00:03:51,560 --> 00:03:57,395 is that words with similar meaning will have similar vector representations. 62 00:03:57,395 --> 00:04:00,750 Usually, both Bag of Words and 63 00:04:00,750 --> 00:04:03,915 Word2vec approaches give quite different results 64 00:04:03,915 --> 00:04:06,740 and can be used together in your solution. 65 00:04:06,740 --> 00:04:09,905 Let's proceed to images now. 66 00:04:09,905 --> 00:04:12,680 Similar to Word2vec for words, 67 00:04:12,680 --> 00:04:17,495 convolutional neural networks can give us compressed representation for an image. 68 00:04:17,495 --> 00:04:19,755 Let me provide you a quick explanation. 69 00:04:19,755 --> 00:04:22,440 When we calculate network output for the image, 70 00:04:22,440 --> 00:04:25,305 beside getting output on the last layer, 71 00:04:25,305 --> 00:04:28,275 we also have outputs from inner layers. 72 00:04:28,275 --> 00:04:30,690 Here, we will call these outputs descriptors. 73 00:04:30,690 --> 00:04:34,560 Descriptors from later layers 74 00:04:34,560 --> 00:04:38,745 are better way to solve texts similar to one network was trained on. 75 00:04:38,745 --> 00:04:45,015 In contrary, descriptors from early layers have more text independent information. 76 00:04:45,015 --> 00:04:49,235 For example, if your network was trained on images and data set, 77 00:04:49,235 --> 00:04:50,570 you may successfully use 78 00:04:50,570 --> 00:04:55,483 its last layer representation in some Kar model classification text. 79 00:04:55,483 --> 00:04:59,270 But if you want to use your network in some medical specific text, 80 00:04:59,270 --> 00:05:03,260 you probably will do better if you will use an 81 00:05:03,260 --> 00:05:07,500 earlier for connected layer or even retrain network from scratch. 82 00:05:07,500 --> 00:05:10,910 Here, you may look for a pre-trained model which was 83 00:05:10,910 --> 00:05:15,785 trained on data similar to what you have in the exact competition. 84 00:05:15,785 --> 00:05:19,930 Sometimes, we can slightly tune network to receive 85 00:05:19,930 --> 00:05:25,895 more suitable representations using targets values associated with our images. 86 00:05:25,895 --> 00:05:31,960 In general, process of pre-trained model tuning is called fine-tuning. 87 00:05:31,960 --> 00:05:33,935 As in the previous example, 88 00:05:33,935 --> 00:05:36,395 when we are solving some medical specific task, 89 00:05:36,395 --> 00:05:39,755 we can find tune VGG RestNet or 90 00:05:39,755 --> 00:05:45,130 any other pre-trained network and specify it to solve these particular texts. 91 00:05:45,130 --> 00:05:48,830 Fine-tuning, especially for small data sets, 92 00:05:48,830 --> 00:05:50,870 is usually better than training 93 00:05:50,870 --> 00:05:55,110 standalone model on descriptors or a training network from scratch. 94 00:05:55,110 --> 00:05:59,185 The intuition here is pretty straightforward. 95 00:05:59,185 --> 00:06:02,310 On the one hand, fine-tuning is better than training 96 00:06:02,310 --> 00:06:05,250 standalone model on descriptors because it 97 00:06:05,250 --> 00:06:08,310 allows to tune all networks parameters and 98 00:06:08,310 --> 00:06:12,430 thus extract more effective image representations. 99 00:06:12,430 --> 00:06:15,520 On the other hand, fine-tuning is better than training 100 00:06:15,520 --> 00:06:18,690 network from scratch if we have too little data, 101 00:06:18,690 --> 00:06:24,270 or if the text we are solving is similar to the text model was trained on. 102 00:06:24,270 --> 00:06:29,970 In this case, model can you use the my knowledge already encoded in networks parameters, 103 00:06:29,970 --> 00:06:33,950 which can lead to better results and the faster retraining procedure. 104 00:06:33,950 --> 00:06:38,665 Lets discuss the most often scenario of using the 105 00:06:38,665 --> 00:06:43,852 fine-tuning on the online stage or the Data Science Game 2016. 106 00:06:43,852 --> 00:06:50,665 The task was to classify these laid photos of roofs into one of four categories. 107 00:06:50,665 --> 00:06:54,345 As usual, logo was first chosen to the other metric. 108 00:06:54,345 --> 00:06:58,480 Competitors had 8,000 different images. 109 00:06:58,480 --> 00:07:02,145 In this setting, it was a good choice to modify 110 00:07:02,145 --> 00:07:04,090 some pre-trained network to predict 111 00:07:04,090 --> 00:07:06,305 probabilities for these four classes and fine tune it. 112 00:07:06,305 --> 00:07:09,710 Let's take a look at 113 00:07:09,710 --> 00:07:16,416 VGG-16 architecture because it was trained on the 1000 classes from VGG RestNet, 114 00:07:16,416 --> 00:07:19,565 it has output of size 1000. 115 00:07:19,565 --> 00:07:21,995 We have only four classes in our text, 116 00:07:21,995 --> 00:07:25,490 so we can remove the last layer with size of 1000 117 00:07:25,490 --> 00:07:30,410 and put in its place a new one with size of four. 118 00:07:30,410 --> 00:07:34,880 Then, we just retrain our model with very smaller rate 119 00:07:34,880 --> 00:07:40,585 is usually about 1000 times lesser than our initial low rate. 120 00:07:40,585 --> 00:07:43,901 That is fine-tuning is done, 121 00:07:43,901 --> 00:07:47,495 but as we already discussed earlier in this video, 122 00:07:47,495 --> 00:07:52,240 we can benefit from using model pre-trained on the similar data set. 123 00:07:52,240 --> 00:07:56,720 Image in by itself consist of very different classes from 124 00:07:56,720 --> 00:08:01,286 animals to cars from furniture to food could define most suitable pre-trained model. 125 00:08:01,286 --> 00:08:04,936 We just could take model trained 126 00:08:04,936 --> 00:08:09,683 on places data set with pictures of buildings and houses, 127 00:08:09,683 --> 00:08:14,025 fine-tuning this model and further improve their result. 128 00:08:14,025 --> 00:08:17,025 If you are interested in details of fine-tuning, 129 00:08:17,025 --> 00:08:23,265 you can find information about it in almost every neural networks library namely Keras, 130 00:08:23,265 --> 00:08:27,030 PyTorch, Caffe or other. 131 00:08:27,030 --> 00:08:33,410 Sometimes, you also want to increase number of training images to train a better network. 132 00:08:33,410 --> 00:08:37,720 In that case, image augmentation may be of help. 133 00:08:37,720 --> 00:08:40,940 Let's illustrate this concept of image augmentation. 134 00:08:40,940 --> 00:08:42,260 On the previous example, 135 00:08:42,260 --> 00:08:45,085 we discussed classification of roof images. 136 00:08:45,085 --> 00:08:51,710 For simplicity, let's imagine that we now have only four images one for each class. 137 00:08:51,710 --> 00:08:54,430 To increase the number of training samples. 138 00:08:54,430 --> 00:08:59,325 let's start with rotating images by 180 degrees. 139 00:08:59,325 --> 00:09:02,035 Note, that after such rotation, 140 00:09:02,035 --> 00:09:06,178 image of class one again belongs to this class because the roof 141 00:09:06,178 --> 00:09:11,210 on the new image also has North-South orientation. 142 00:09:11,210 --> 00:09:15,215 Easy to see that the same is true for other classes. 143 00:09:15,215 --> 00:09:18,905 Great. After doing just one rotation, 144 00:09:18,905 --> 00:09:23,480 we already increase the amount of our trained data twice. 145 00:09:23,480 --> 00:09:28,655 Now, what will happen if we rotate image from the first class by 90 degrees? 146 00:09:28,655 --> 00:09:31,130 What class will it belong to? 147 00:09:31,130 --> 00:09:35,045 Yeah, it will belong to the second class and eventually, 148 00:09:35,045 --> 00:09:40,114 if you rotate images from the third and the fourth classes by 90 degrees, 149 00:09:40,114 --> 00:09:42,570 they will stay in the same class. 150 00:09:42,570 --> 00:09:48,710 Look, we just increase the size of our trained set four times although 151 00:09:48,710 --> 00:09:51,380 adding such augmentations isn't so 152 00:09:51,380 --> 00:09:55,320 effective as adding brand new images to the trained set. 153 00:09:55,320 --> 00:10:00,410 This is still very useful and can boost your score significantly. 154 00:10:00,410 --> 00:10:04,085 In general case, augmentation of images can include groups, 155 00:10:04,085 --> 00:10:07,240 rotations, and the noise and so on. 156 00:10:07,240 --> 00:10:10,670 Overall, this reduces over fitting and 157 00:10:10,670 --> 00:10:14,760 allows you to train more robust models with better results. 158 00:10:14,760 --> 00:10:21,370 One last note about the extracting vectors from images and this note is important one. 159 00:10:21,370 --> 00:10:26,155 If you want to fine-tuning convolutiontional neural network or train it from scratch, 160 00:10:26,155 --> 00:10:30,355 you usually will need to use labels from images in the trained set. 161 00:10:30,355 --> 00:10:34,299 So be careful with validation here and do not over fit. 162 00:10:34,299 --> 00:10:39,550 Well then, let's recall main points we have discussed here. 163 00:10:39,550 --> 00:10:44,950 Sometimes, you have a competition with texts or images as additional data. 164 00:10:44,950 --> 00:10:47,450 In this case, usually you want to extract 165 00:10:47,450 --> 00:10:51,055 the useful features from them to improve your model. 166 00:10:51,055 --> 00:10:52,720 When you work with text, 167 00:10:52,720 --> 00:10:55,490 pre-processing can prove to be useful. 168 00:10:55,490 --> 00:10:58,940 These pre-processing can include all lowercase, stemming, lemmatization, 169 00:10:58,940 --> 00:11:02,775 and removing the stopwords. 170 00:11:02,775 --> 00:11:05,705 After that pre-processing is done, 171 00:11:05,705 --> 00:11:10,505 you can go either Bag of Words or with the Word2vec approach. 172 00:11:10,505 --> 00:11:14,855 Bag of Words guarantees you clear interpretation. 173 00:11:14,855 --> 00:11:17,130 Each feature are tuned by means of having 174 00:11:17,130 --> 00:11:21,175 a huge amount of features one for each unique word. 175 00:11:21,175 --> 00:11:24,095 On other side, Word2vec produces 176 00:11:24,095 --> 00:11:28,665 relatively small vectors by meaning of each feature value can be hazy. 177 00:11:28,665 --> 00:11:34,370 The other advantage of Word2vec that is crucial in competitions is that 178 00:11:34,370 --> 00:11:41,060 words with similar meaning will have similar vector representation. 179 00:11:41,060 --> 00:11:46,400 Also, Ngrams can be applied to include words interactions for text and 180 00:11:46,400 --> 00:11:52,880 TFiDF can be applied to post-process metrics produced by Bag of Words. 181 00:11:52,880 --> 00:11:55,830 Now images. For images, 182 00:11:55,830 --> 00:12:00,825 we can use pre-trained convolutional neural networks to extract the features. 183 00:12:00,825 --> 00:12:03,360 Depending on the similarity between 184 00:12:03,360 --> 00:12:07,980 the competition data and the data neural network was trained on, 185 00:12:07,980 --> 00:12:11,765 we may want to calculate descriptors from different layers. 186 00:12:11,765 --> 00:12:18,065 Often, fine-tuning of neural network can help improve quality of the descriptors. 187 00:12:18,065 --> 00:12:21,120 For the purpose of effective fine-tuning, 188 00:12:21,120 --> 00:12:25,095 we may want to augment our data. 189 00:12:25,095 --> 00:12:29,490 Also, fine-tuning and data augmentation are often used in 190 00:12:29,490 --> 00:12:35,230 competitions where we have no other date except images. 191 00:12:35,230 --> 00:12:38,750 Besides, there are a number of pre-trained models for 192 00:12:38,750 --> 00:12:43,385 convolutional neural networks and Word2vec on the internet. 193 00:12:43,385 --> 00:12:46,640 Great. Now, you know how to handle 194 00:12:46,640 --> 00:12:51,245 competitions with additional data like text and images. 195 00:12:51,245 --> 00:12:54,800 By applying and adapting ideas we have discussed, 196 00:12:54,800 --> 00:12:59,890 you will be able to gain an edge in this kind of setting.