1
00:00:01,440 --> 00:00:06,640
Let's start to formalize the problem of learning a good word embedding.

2
00:00:06,640 --> 00:00:09,230
When you implement an algorithm to learn a word embedding,

3
00:00:09,230 --> 00:00:12,698
what you end up learning is an embedding matrix.

4
00:00:12,698 --> 00:00:14,285
Let's take a look at what that means.

5
00:00:14,285 --> 00:00:19,005
Let's say, as usual we're using our 10,000-word vocabulary.

6
00:00:19,005 --> 00:00:21,412
So, the vocabulary has A, Aaron,

7
00:00:21,412 --> 00:00:29,750
Orange, Zulu, maybe also unknown word as a token.

8
00:00:29,750 --> 00:00:34,475
What we're going to do is learn embedding matrix E,

9
00:00:34,475 --> 00:00:43,900
which is going to be a 300 dimensional by 10,000 dimensional matrix,

10
00:00:43,900 --> 00:00:48,933
if you have 10,000 words vocabulary or maybe 10,001 is our word token,

11
00:00:48,933 --> 00:00:51,105
there's one extra token.

12
00:00:51,105 --> 00:00:54,675
And the columns of this matrix would be the different embeddings

13
00:00:54,675 --> 00:00:58,245
for the 10,000 different words you have in your vocabulary.

14
00:00:58,245 --> 00:01:05,475
So, Orange was word number 6257 in our vocabulary of 10,000 words.

15
00:01:05,475 --> 00:01:12,300
So, one piece of notation we'll use is that 06257 was

16
00:01:12,300 --> 00:01:20,995
the one-hot vector with zeros everywhere and a one in position 6257.

17
00:01:20,995 --> 00:01:27,690
And so, this will be a 10,000-dimensional vector with a one in just one position.

18
00:01:27,690 --> 00:01:29,430
So, this isn't quite a drawn scale.

19
00:01:29,430 --> 00:01:34,220
Yes, this should be as tall as the embedding matrix on the left is wide.

20
00:01:34,220 --> 00:01:40,530
And if the embedding matrix is called capital E then notice that if you

21
00:01:40,530 --> 00:01:49,140
take E and multiply it by just one-hot vector by 0 of 6257,

22
00:01:49,140 --> 00:01:54,045
then this will be a 300-dimensional vector.

23
00:01:54,045 --> 00:02:02,170
So, E is 300 by 10,000 and 0 is 10,000 by 1.

24
00:02:02,170 --> 00:02:06,700
So, the product will be 300 by 1,

25
00:02:06,700 --> 00:02:10,060
so with 300-dimensional vector and notice that

26
00:02:10,060 --> 00:02:13,840
to compute the first element of this vector,

27
00:02:13,840 --> 00:02:15,385
of this 300-dimensional vector,

28
00:02:15,385 --> 00:02:21,565
what you do is you will multiply the first row of the matrix E with this.

29
00:02:21,565 --> 00:02:24,535
But all of these elements are zero except for

30
00:02:24,535 --> 00:02:27,970
element 6257 and so you end up with zero times this,

31
00:02:27,970 --> 00:02:30,490
zero times this, zero times this, and so on.

32
00:02:30,490 --> 00:02:34,270
And then, 1 times whatever this is,

33
00:02:34,270 --> 00:02:37,060
and zero times this, zero times this, zero times and so on.

34
00:02:37,060 --> 00:02:44,350
And so, you end up with the first element as whatever is that elements up there,

35
00:02:44,350 --> 00:02:46,235
under the Orange column.

36
00:02:46,235 --> 00:02:50,169
And then, for the second element of this 300-dimensional vector we're computing,

37
00:02:50,169 --> 00:02:52,210
you would take the vector

38
00:02:52,210 --> 00:02:57,555
0657 and multiply it by the second row with the matrix E. So again,

39
00:02:57,555 --> 00:03:00,010
you have zero times this,

40
00:03:00,010 --> 00:03:01,835
plus zero times this,

41
00:03:01,835 --> 00:03:07,855
plus zero times all of these are the elements and then one times this,

42
00:03:07,855 --> 00:03:10,330
and then zero times everything else and add that together.

43
00:03:10,330 --> 00:03:19,795
So you end up with this and so on as you go down the rest of this column.

44
00:03:19,795 --> 00:03:25,485
So, that's why the embedding matrix E times this one-hot vector here winds up

45
00:03:25,485 --> 00:03:32,430
selecting out this 300-dimensional column corresponding to the word Orange.

46
00:03:32,430 --> 00:03:38,870
So, this is going to be equal to E 6257 which

47
00:03:38,870 --> 00:03:41,245
is the notation we're going to use to represent

48
00:03:41,245 --> 00:03:46,345
the embedding vector that 300 by one dimensional vector for the word Orange.

49
00:03:46,345 --> 00:03:49,658
And more generally, E for a specific word W,

50
00:03:49,658 --> 00:03:56,195
this is going to be embedding for a word W. And more generally,

51
00:03:56,195 --> 00:03:59,583
E times O substitute J,

52
00:03:59,583 --> 00:04:02,500
one-hot vector with one that position J,

53
00:04:02,500 --> 00:04:05,090
this is going to be E_J and that's going to be

54
00:04:05,090 --> 00:04:11,695
the embedding for word J in the vocabulary.

55
00:04:11,695 --> 00:04:16,325
So, the thing to remember from this slide is that

56
00:04:16,325 --> 00:04:21,650
our goal will be to learn an embedding matrix E and what you see in the next video

57
00:04:21,650 --> 00:04:24,460
is you initialize E randomly and you're straight in

58
00:04:24,460 --> 00:04:27,725
the sense to learn all the parameters of this 300 by

59
00:04:27,725 --> 00:04:35,105
10,000 dimensional matrix and E times this one-hot vector gives you the embedding vector.

60
00:04:35,105 --> 00:04:37,390
Now just one note,

61
00:04:37,390 --> 00:04:38,742
when we're writing the equation,

62
00:04:38,742 --> 00:04:41,750
it'll be convenient to write this type of notation where you take

63
00:04:41,750 --> 00:04:45,726
the matrix E and multiply it by the one-hot vector O.

64
00:04:45,726 --> 00:04:49,095
But if when you're implementing this,

65
00:04:49,095 --> 00:04:52,400
it is not efficient to actually implement this as

66
00:04:52,400 --> 00:04:56,627
a mass matrix vector multiplication because the one-hot vectors,

67
00:04:56,627 --> 00:05:01,355
now this is a relatively high dimensional vector and most of these elements are zero.

68
00:05:01,355 --> 00:05:03,560
So, it's actually not efficient to use

69
00:05:03,560 --> 00:05:06,770
a matrix vector multiplication to implement this because if

70
00:05:06,770 --> 00:05:10,070
we multiply a whole bunch of things by zeros and so the practice,

71
00:05:10,070 --> 00:05:12,950
you would actually use a specialized function to just look up

72
00:05:12,950 --> 00:05:17,335
a column of the Matrix E rather than do this with the matrix multiplication.

73
00:05:17,335 --> 00:05:20,945
But writing of the map, it is just convenient to write it out this way.

74
00:05:20,945 --> 00:05:24,545
So, in Cara's for example there is

75
00:05:24,545 --> 00:05:29,637
a embedding layer and we use the embedding layer then it

76
00:05:29,637 --> 00:05:33,970
more efficiently just pulls out the column you want from the embedding matrix rather

77
00:05:33,970 --> 00:05:38,638
than does it with a much slower matrix vector multiplication.

78
00:05:38,638 --> 00:05:40,870
So, in this video you saw the notations were used to

79
00:05:40,870 --> 00:05:43,740
describe algorithms to learning these embeddings and

80
00:05:43,740 --> 00:05:45,220
the key terminology is

81
00:05:45,220 --> 00:05:50,265
this matrix capital E which contain all the embeddings for the words of the vocabulary.

82
00:05:50,265 --> 00:05:53,590
In the next video, we'll start to talk about specific algorithms

83
00:05:53,590 --> 00:05:57,030
for learning this matrix E. Let's go onto the next video.