1 00:00:01,440 --> 00:00:06,640 Let's start to formalize the problem of learning a good word embedding. 2 00:00:06,640 --> 00:00:09,230 When you implement an algorithm to learn a word embedding, 3 00:00:09,230 --> 00:00:12,698 what you end up learning is an embedding matrix. 4 00:00:12,698 --> 00:00:14,285 Let's take a look at what that means. 5 00:00:14,285 --> 00:00:19,005 Let's say, as usual we're using our 10,000-word vocabulary. 6 00:00:19,005 --> 00:00:21,412 So, the vocabulary has A, Aaron, 7 00:00:21,412 --> 00:00:29,750 Orange, Zulu, maybe also unknown word as a token. 8 00:00:29,750 --> 00:00:34,475 What we're going to do is learn embedding matrix E, 9 00:00:34,475 --> 00:00:43,900 which is going to be a 300 dimensional by 10,000 dimensional matrix, 10 00:00:43,900 --> 00:00:48,933 if you have 10,000 words vocabulary or maybe 10,001 is our word token, 11 00:00:48,933 --> 00:00:51,105 there's one extra token. 12 00:00:51,105 --> 00:00:54,675 And the columns of this matrix would be the different embeddings 13 00:00:54,675 --> 00:00:58,245 for the 10,000 different words you have in your vocabulary. 14 00:00:58,245 --> 00:01:05,475 So, Orange was word number 6257 in our vocabulary of 10,000 words. 15 00:01:05,475 --> 00:01:12,300 So, one piece of notation we'll use is that 06257 was 16 00:01:12,300 --> 00:01:20,995 the one-hot vector with zeros everywhere and a one in position 6257. 17 00:01:20,995 --> 00:01:27,690 And so, this will be a 10,000-dimensional vector with a one in just one position. 18 00:01:27,690 --> 00:01:29,430 So, this isn't quite a drawn scale. 19 00:01:29,430 --> 00:01:34,220 Yes, this should be as tall as the embedding matrix on the left is wide. 20 00:01:34,220 --> 00:01:40,530 And if the embedding matrix is called capital E then notice that if you 21 00:01:40,530 --> 00:01:49,140 take E and multiply it by just one-hot vector by 0 of 6257, 22 00:01:49,140 --> 00:01:54,045 then this will be a 300-dimensional vector. 23 00:01:54,045 --> 00:02:02,170 So, E is 300 by 10,000 and 0 is 10,000 by 1. 24 00:02:02,170 --> 00:02:06,700 So, the product will be 300 by 1, 25 00:02:06,700 --> 00:02:10,060 so with 300-dimensional vector and notice that 26 00:02:10,060 --> 00:02:13,840 to compute the first element of this vector, 27 00:02:13,840 --> 00:02:15,385 of this 300-dimensional vector, 28 00:02:15,385 --> 00:02:21,565 what you do is you will multiply the first row of the matrix E with this. 29 00:02:21,565 --> 00:02:24,535 But all of these elements are zero except for 30 00:02:24,535 --> 00:02:27,970 element 6257 and so you end up with zero times this, 31 00:02:27,970 --> 00:02:30,490 zero times this, zero times this, and so on. 32 00:02:30,490 --> 00:02:34,270 And then, 1 times whatever this is, 33 00:02:34,270 --> 00:02:37,060 and zero times this, zero times this, zero times and so on. 34 00:02:37,060 --> 00:02:44,350 And so, you end up with the first element as whatever is that elements up there, 35 00:02:44,350 --> 00:02:46,235 under the Orange column. 36 00:02:46,235 --> 00:02:50,169 And then, for the second element of this 300-dimensional vector we're computing, 37 00:02:50,169 --> 00:02:52,210 you would take the vector 38 00:02:52,210 --> 00:02:57,555 0657 and multiply it by the second row with the matrix E. So again, 39 00:02:57,555 --> 00:03:00,010 you have zero times this, 40 00:03:00,010 --> 00:03:01,835 plus zero times this, 41 00:03:01,835 --> 00:03:07,855 plus zero times all of these are the elements and then one times this, 42 00:03:07,855 --> 00:03:10,330 and then zero times everything else and add that together. 43 00:03:10,330 --> 00:03:19,795 So you end up with this and so on as you go down the rest of this column. 44 00:03:19,795 --> 00:03:25,485 So, that's why the embedding matrix E times this one-hot vector here winds up 45 00:03:25,485 --> 00:03:32,430 selecting out this 300-dimensional column corresponding to the word Orange. 46 00:03:32,430 --> 00:03:38,870 So, this is going to be equal to E 6257 which 47 00:03:38,870 --> 00:03:41,245 is the notation we're going to use to represent 48 00:03:41,245 --> 00:03:46,345 the embedding vector that 300 by one dimensional vector for the word Orange. 49 00:03:46,345 --> 00:03:49,658 And more generally, E for a specific word W, 50 00:03:49,658 --> 00:03:56,195 this is going to be embedding for a word W. And more generally, 51 00:03:56,195 --> 00:03:59,583 E times O substitute J, 52 00:03:59,583 --> 00:04:02,500 one-hot vector with one that position J, 53 00:04:02,500 --> 00:04:05,090 this is going to be E_J and that's going to be 54 00:04:05,090 --> 00:04:11,695 the embedding for word J in the vocabulary. 55 00:04:11,695 --> 00:04:16,325 So, the thing to remember from this slide is that 56 00:04:16,325 --> 00:04:21,650 our goal will be to learn an embedding matrix E and what you see in the next video 57 00:04:21,650 --> 00:04:24,460 is you initialize E randomly and you're straight in 58 00:04:24,460 --> 00:04:27,725 the sense to learn all the parameters of this 300 by 59 00:04:27,725 --> 00:04:35,105 10,000 dimensional matrix and E times this one-hot vector gives you the embedding vector. 60 00:04:35,105 --> 00:04:37,390 Now just one note, 61 00:04:37,390 --> 00:04:38,742 when we're writing the equation, 62 00:04:38,742 --> 00:04:41,750 it'll be convenient to write this type of notation where you take 63 00:04:41,750 --> 00:04:45,726 the matrix E and multiply it by the one-hot vector O. 64 00:04:45,726 --> 00:04:49,095 But if when you're implementing this, 65 00:04:49,095 --> 00:04:52,400 it is not efficient to actually implement this as 66 00:04:52,400 --> 00:04:56,627 a mass matrix vector multiplication because the one-hot vectors, 67 00:04:56,627 --> 00:05:01,355 now this is a relatively high dimensional vector and most of these elements are zero. 68 00:05:01,355 --> 00:05:03,560 So, it's actually not efficient to use 69 00:05:03,560 --> 00:05:06,770 a matrix vector multiplication to implement this because if 70 00:05:06,770 --> 00:05:10,070 we multiply a whole bunch of things by zeros and so the practice, 71 00:05:10,070 --> 00:05:12,950 you would actually use a specialized function to just look up 72 00:05:12,950 --> 00:05:17,335 a column of the Matrix E rather than do this with the matrix multiplication. 73 00:05:17,335 --> 00:05:20,945 But writing of the map, it is just convenient to write it out this way. 74 00:05:20,945 --> 00:05:24,545 So, in Cara's for example there is 75 00:05:24,545 --> 00:05:29,637 a embedding layer and we use the embedding layer then it 76 00:05:29,637 --> 00:05:33,970 more efficiently just pulls out the column you want from the embedding matrix rather 77 00:05:33,970 --> 00:05:38,638 than does it with a much slower matrix vector multiplication. 78 00:05:38,638 --> 00:05:40,870 So, in this video you saw the notations were used to 79 00:05:40,870 --> 00:05:43,740 describe algorithms to learning these embeddings and 80 00:05:43,740 --> 00:05:45,220 the key terminology is 81 00:05:45,220 --> 00:05:50,265 this matrix capital E which contain all the embeddings for the words of the vocabulary. 82 00:05:50,265 --> 00:05:53,590 In the next video, we'll start to talk about specific algorithms 83 00:05:53,590 --> 00:05:57,030 for learning this matrix E. Let's go onto the next video.