1 00:00:00,000 --> 00:00:04,349 [MUSIC] 2 00:00:04,349 --> 00:00:09,168 So that was one idea of computing an inner product to compute a distance, but 3 00:00:09,168 --> 00:00:14,360 here's another really natural inner product measure that we could have. 4 00:00:14,360 --> 00:00:19,400 And this is simply to look at the inner product between our 5 00:00:20,610 --> 00:00:26,870 one article which might be our query article xq and another article xi, 6 00:00:26,870 --> 00:00:31,945 simply as the inner product between these two vectors that we are going to multiply, 7 00:00:31,945 --> 00:00:37,610 element-wise each of the different values, 8 00:00:37,610 --> 00:00:40,260 add them up, and call that the similarity. 9 00:00:40,260 --> 00:00:46,780 So it's the inner product between xi and xq, so it's simply the sum 10 00:00:46,780 --> 00:00:51,900 of all our dimensions of the product of the values in these two vectors. 11 00:00:51,900 --> 00:00:57,660 So we can think of this as measuring, how much these articles overlap in terms 12 00:00:57,660 --> 00:01:03,560 of the vocabularies used and what's the weight of that overlap. 13 00:01:03,560 --> 00:01:07,560 Okay, so in this example it would give us a similarity of 13 between 14 00:01:07,560 --> 00:01:10,640 these two different articles about soccer. 15 00:01:10,640 --> 00:01:15,030 But if we looked at an article about soccer relative to an article about 16 00:01:15,030 --> 00:01:19,740 some world news event, then maybe there would be very little overlap. 17 00:01:19,740 --> 00:01:23,760 Actually in this case, no overlap in the vocabularies of these articles. 18 00:01:23,760 --> 00:01:25,879 So this similarity would be 0. 19 00:01:25,879 --> 00:01:30,509 And the similarity that we talked about on the previous slide 20 00:01:30,509 --> 00:01:35,411 where we just summed up the products of the different features is 21 00:01:35,411 --> 00:01:39,861 very related to a popular similarity metric called cosine 22 00:01:39,861 --> 00:01:44,890 similarity where it looks exactly the same as what we had before. 23 00:01:44,890 --> 00:01:50,710 But in cosine similarity you're going to divide by the following two terms where 24 00:01:50,710 --> 00:01:53,600 I think it's a little bit more straightforward to write it as follows 25 00:01:53,600 --> 00:01:56,140 where what we see is that 26 00:01:56,140 --> 00:01:59,960 each one of these terms is just normalizing the vector. 27 00:01:59,960 --> 00:02:05,080 So we're summing over the square of each element just within vector xi and 28 00:02:05,080 --> 00:02:06,700 just within vector xq. 29 00:02:07,920 --> 00:02:12,730 So that's, by definition, this norm here. 30 00:02:12,730 --> 00:02:16,851 So this is equivalent to this. 31 00:02:16,851 --> 00:02:19,939 That's the definition of these bars here, 32 00:02:19,939 --> 00:02:23,034 is the norm of the magnitude of this vector. 33 00:02:23,034 --> 00:02:27,344 And we can rewrite this further as xi 34 00:02:27,344 --> 00:02:32,374 divided by the magnitude of this vector or 35 00:02:32,374 --> 00:02:40,584 the norm of the vector transposed times xq divided by the norm of xq. 36 00:02:40,584 --> 00:02:45,129 And so what we're doing relative to the example we had on the past two 37 00:02:45,129 --> 00:02:49,916 slides is instead of just computing the similarity as the inner product 38 00:02:49,916 --> 00:02:54,805 between these two vectors, we're first going to normalize the vectors. 39 00:02:54,805 --> 00:02:56,066 And this is a really, 40 00:02:56,066 --> 00:03:00,000 really critical difference actually which we'll discuss more. 41 00:03:02,110 --> 00:03:06,636 Okay, and so you can show that what we're doing here is equivalent to just 42 00:03:06,636 --> 00:03:11,308 looking at the angle between two vectors, regardless of their magnitude. 43 00:03:11,308 --> 00:03:15,887 And the reason this normalized inner product of vectors is 44 00:03:15,887 --> 00:03:20,559 equivalent to cosine of the angle between the vectors comes 45 00:03:20,559 --> 00:03:25,430 straightforwardly from the definition of an inner product. 46 00:03:25,430 --> 00:03:30,670 So we know that A transposed B is equal to 47 00:03:30,670 --> 00:03:36,050 magnitude of some Vector A magnitude of the Vector B 48 00:03:37,570 --> 00:03:42,160 times cosine of the angle between the vectors A and B. 49 00:03:43,340 --> 00:03:47,815 Okay, so if we have two different points here, two different articles, 50 00:03:47,815 --> 00:03:51,670 let's say there's just a vocabulary with two words, word one and word two. 51 00:03:53,030 --> 00:03:58,950 So this is maybe the word count vector for 52 00:03:58,950 --> 00:04:02,860 one article and the word count vector for the other article. 53 00:04:02,860 --> 00:04:07,862 What cosine similarity is doing was just looking at the cosine of 54 00:04:07,862 --> 00:04:13,615 the angle between the angles regardless of the magnitude of this vector. 55 00:04:15,278 --> 00:04:19,080 So I want to highlight a couple of things about cosine similarity. 56 00:04:19,080 --> 00:04:22,560 One is the fact that it's not actually a proper distance metric like you 57 00:04:22,560 --> 00:04:27,050 put in distance because the triangle inequality doesn't hold. 58 00:04:27,050 --> 00:04:32,300 But it's also important to know that it's extremely efficient to compute for 59 00:04:32,300 --> 00:04:37,500 sparse vectors because you only need to consider the nonzero elements when you're 60 00:04:37,500 --> 00:04:38,620 forming this calculation. 61 00:04:41,010 --> 00:04:44,800 Okay, but now I want to run through an example of what I mean by this 62 00:04:44,800 --> 00:04:48,020 normalization just to make sure it's very clear to everyone. 63 00:04:48,020 --> 00:04:53,490 So here's our standard wordcount vector, or maybe TFIDF vector, and 64 00:04:53,490 --> 00:04:58,960 when we think about normalizing it, we're simply thinking about dividing by 65 00:04:58,960 --> 00:05:03,870 the sum of the squares of the counts in this vector square rooted. 66 00:05:03,870 --> 00:05:07,680 And if we do out this calculation then 67 00:05:07,680 --> 00:05:11,220 our normalized representation of this document would be as follows. 68 00:05:12,270 --> 00:05:16,450 Okay, so let's talk a little bit more about this cosine similarity metric. 69 00:05:19,420 --> 00:05:22,780 Let's think about the values that it can take. 70 00:05:22,780 --> 00:05:26,830 So let's say there are two really, really similar articles. 71 00:05:26,830 --> 00:05:34,682 So they have a very small angle theta, very small theta. 72 00:05:34,682 --> 00:05:40,130 Then the cosine of theta Is going to be 73 00:05:40,130 --> 00:05:46,000 approximately equal to, you guys remember, 1, as you would hope. 74 00:05:46,000 --> 00:05:49,920 High similarity if they're very close together. 75 00:05:49,920 --> 00:05:53,860 So let's just remember how we think about cosine. 76 00:05:53,860 --> 00:05:56,920 We can think about drawing this unit circle. 77 00:05:59,220 --> 00:06:01,320 That does not look at all like a circle. 78 00:06:01,320 --> 00:06:06,276 Let's try again. 79 00:06:06,276 --> 00:06:12,858 Okay, I'm not sure that is much better but imagine our circle with Radius 1. 80 00:06:12,858 --> 00:06:21,059 Well, if we are looking at some angle of theta, 81 00:06:21,059 --> 00:06:28,030 then this length here is sign of theta and 82 00:06:28,030 --> 00:06:34,410 this length here is cosine of theta. 83 00:06:34,410 --> 00:06:36,540 Okay, so as were walking around the circle, 84 00:06:36,540 --> 00:06:42,000 when an angle is zero, we see that cosine of theta is one. 85 00:06:42,000 --> 00:06:47,970 When we get up to 90 degrees, so vertical, we see that that cosine drops down. 86 00:06:47,970 --> 00:06:52,239 Maybe I can switch colors and make this clear instead of waving my hands around. 87 00:06:55,302 --> 00:06:56,130 So let's see. 88 00:06:56,130 --> 00:07:02,866 When we're at angle 0, you see cosine is 1. 89 00:07:02,866 --> 00:07:07,767 When we're at angle pi over 2 or 90 degrees, 90 00:07:07,767 --> 00:07:14,030 we see that this distance along this x axis has dropped to 0. 91 00:07:14,030 --> 00:07:21,320 And when we shift over here to cosine of pi or 92 00:07:21,320 --> 00:07:27,810 180 degrees, we see that cosine of theta is -1. 93 00:07:27,810 --> 00:07:31,170 Okay, so now that we've reviewed cosine a little bit, let's go back 94 00:07:32,780 --> 00:07:39,140 to these drawings here and this was supposed to be roughly 90 degrees. 95 00:07:39,140 --> 00:07:46,249 It's a little bit more, but when we look at cosine of theta here, 96 00:07:46,249 --> 00:07:50,402 we know that it's approximately 0. 97 00:07:50,402 --> 00:07:52,211 And if we have, 98 00:07:57,263 --> 00:08:02,280 Cosine of theta in this case it's going to be approximately -1. 99 00:08:02,280 --> 00:08:09,736 Okay, so in general cosine similarity can range from -1 to 1. 100 00:08:09,736 --> 00:08:14,002 But if we restrict ourselves to having just positive features, 101 00:08:14,002 --> 00:08:17,399 like we would if we were looking at a TFIDF vector for 102 00:08:17,399 --> 00:08:22,070 a document, in our similarity we could never have this example here. 103 00:08:22,070 --> 00:08:25,465 We're always going to be living in a positive quadrant, so 104 00:08:25,465 --> 00:08:28,022 our angles are going to range from 0 to 90. 105 00:08:28,022 --> 00:08:32,920 So our cosine similarity is going to range from 0 to 1. 106 00:08:32,920 --> 00:08:35,989 Okay, so this is going to be our focus. 107 00:08:38,550 --> 00:08:40,070 And in these cases, 108 00:08:40,070 --> 00:08:45,150 the way we're going to define a distance is simply one minus the similarity. 109 00:08:45,150 --> 00:08:50,658 And remember, it's not a proper distance, according to formal definitions 110 00:08:50,658 --> 00:08:56,256 of a distance metric but we can use it as a measure of distance between articles. 111 00:08:56,256 --> 00:09:00,539 [MUSIC]