1 00:00:00,000 --> 00:00:05,088 In this video, I'm going to talk about the use of binary codes for image 2 00:00:05,088 --> 00:00:08,554 retrieval. For retrieving documents, people like 3 00:00:08,554 --> 00:00:14,380 Google have such good methods already, the techniques like semantic hashing may 4 00:00:14,380 --> 00:00:18,510 not be of much value. But retrieving images is much more 5 00:00:18,510 --> 00:00:24,631 difficult and the methods that convert an image into a fairly large binary code of 6 00:00:24,631 --> 00:00:30,674 say, 256 bits, seem to work quite well. However, we don't want to do a very long 7 00:00:30,674 --> 00:00:36,333 sequencial search through vectors of 256 bits so semantic hashing can be used to 8 00:00:36,333 --> 00:00:41,782 first create a short list and then we can get better quality matches by using 9 00:00:41,782 --> 00:00:47,371 longer binary codes in a serial search. Now, we get to look at using binary codes 10 00:00:47,371 --> 00:00:51,563 for image retrieval. Image retrieval of present is typically 11 00:00:51,563 --> 00:00:55,357 done by using the captions. But why not use the images too? 12 00:00:55,357 --> 00:00:59,295 They obviously contain a lot more information than the captions. 13 00:00:59,295 --> 00:01:02,495 The basic problem is that pixels are not like words. 14 00:01:02,495 --> 00:01:06,618 Individual pixels don't tell us much about the content of an image. 15 00:01:06,618 --> 00:01:11,171 Obviously if you could recognize the objects in the images, then we'd have 16 00:01:11,171 --> 00:01:15,416 things that were much more like words. But recognizing objects is hard. 17 00:01:15,416 --> 00:01:18,038 At least it was when we first did this work. 18 00:01:18,038 --> 00:01:22,984 Now deep neural nets have got much better at it, and so that may well be the way to 19 00:01:22,984 --> 00:01:25,487 go. So if we're not going to recognize the 20 00:01:25,487 --> 00:01:29,837 objects, maybe what we should do is extract a vector that has information 21 00:01:29,837 --> 00:01:34,067 about the content of the image. And the obvious thing to extract is the 22 00:01:34,067 --> 00:01:38,517 real valued vector. But the problem is that matching real 23 00:01:38,517 --> 00:01:43,264 value vectors in the real data base is slow and it also requires a lot of 24 00:01:43,264 --> 00:01:48,075 storage for the real value vectors. If we can extract a fairly short binary 25 00:01:48,075 --> 00:01:53,206 vector that contains a lot of information about the image that's much easier to 26 00:01:53,206 --> 00:01:57,825 store and much faster to match, even faster is to use a two stage method. 27 00:01:57,825 --> 00:02:02,892 So first, we extract a short binary code of about 30 bids and that short binary 28 00:02:02,892 --> 00:02:07,639 code is used with semantic hashing to very rapidly get us a short list of 29 00:02:07,639 --> 00:02:11,550 promising images. So we simply take the short binary code, 30 00:02:11,550 --> 00:02:15,075 and flip a few bits in it to get candidate images. 31 00:02:15,075 --> 00:02:20,505 The candidate images can then be matched using 256 bit binary codes that are 32 00:02:20,505 --> 00:02:25,300 stored with each known image. To search for much better matches than 33 00:02:25,300 --> 00:02:31,757 can be found with a 28 bit binary code. Even a 256 bit binary code only requires 34 00:02:31,757 --> 00:02:37,561 four words of memory per image. And even though we're then going to do a 35 00:02:37,561 --> 00:02:42,109 serial on these binary codes. The search can be done very fast, because 36 00:02:42,109 --> 00:02:47,326 it only requires a few operations to compare two, 256 binary codes to find out 37 00:02:47,326 --> 00:02:52,331 how many bits they have in common. The question is, how good is a binary 38 00:02:52,331 --> 00:02:58,015 code of that size at retrieving images?. Are they going to find images that we 39 00:02:58,015 --> 00:03:02,665 think of as being similar? So here's a net that trained by Alex 40 00:03:02,665 --> 00:03:06,283 Krizhevsky. It's working on small color images so 41 00:03:06,283 --> 00:03:11,967 they're only 32 pixels by 32 pixels and he takes his input the red, green, and 42 00:03:11,967 --> 00:03:15,510 blue channels from those images so 3,000 inputs. 43 00:03:15,510 --> 00:03:21,342 He then expands that to a larger number of hidden units because we're going to go 44 00:03:21,342 --> 00:03:24,839 from real value inputs to logistic hidden units, 45 00:03:24,839 --> 00:03:30,310 which probably have less capacity. We then progressively decrease the number 46 00:03:30,310 --> 00:03:33,910 units in each layer until we get down to 256 bits. 47 00:03:33,910 --> 00:03:37,787 This encoder has about 67 million parameters. 48 00:03:37,787 --> 00:03:42,193 It's quite big. It takes a few days to train it on, in 49 00:03:42,193 --> 00:03:45,308 video GPU. And I'll exchange it on two million 50 00:03:45,308 --> 00:03:48,491 images. There's absolutely no theory to justify 51 00:03:48,491 --> 00:03:52,487 the architecture we used. We know we want a fairly deep net. 52 00:03:52,487 --> 00:03:55,941 It makes sense to make it get an arrow as we go up. 53 00:03:55,941 --> 00:04:01,291 But this particular architecture, where you have the number of units each layer, 54 00:04:01,291 --> 00:04:04,948 is just a guess. The interesting thing is, a guess like 55 00:04:04,948 --> 00:04:09,010 this already works quite well. And presumably, there are some other 56 00:04:09,010 --> 00:04:13,537 architectures that will work better. The first question to ask is how well 57 00:04:13,537 --> 00:04:17,810 does an ordering coder like this do at reconstructing the images? 58 00:04:17,810 --> 00:04:22,674 So, here is a face image and it's reconstruction and you can see that from 59 00:04:22,674 --> 00:04:26,552 the reconstruction you can tell the kind of an image it is. 60 00:04:26,552 --> 00:04:31,285 Here's another example, where it's a scene probably at a party, you can't 61 00:04:31,285 --> 00:04:36,741 really tell what kind of scene it is from the image but you might guess that there 62 00:04:36,741 --> 00:04:39,962 are a number of people involved or you might not. 63 00:04:39,962 --> 00:04:44,002 Here's an outdoor scene. And you can see that the reconstruction 64 00:04:44,002 --> 00:04:47,020 captures a lot of information about the accuracy. 65 00:04:47,020 --> 00:04:52,061 It captures the water in the sky and the hyphens of thin strip land. 66 00:04:52,061 --> 00:04:57,176 So lets look at the quality of retrievable we can do within ordering 67 00:04:57,176 --> 00:05:00,661 codes that gives those kinds of reconstruction. 68 00:05:00,661 --> 00:05:06,073 We'll start with a picture of Michael Jackson in the red square and Alex 69 00:05:06,073 --> 00:05:11,707 retrieved the most similar images and above each image you can see how many 70 00:05:11,707 --> 00:05:14,970 bits had differed from Michael Jackson body. 71 00:05:14,970 --> 00:05:19,483 And you'll notice they're all a fairly similar number of bits. 72 00:05:19,483 --> 00:05:25,235 In 256 bits, differing by only 61 bits is extraordinarily unlikely to happen by 73 00:05:25,235 --> 00:05:30,549 chance if they were random images. It has to be a pretty similar image to 74 00:05:30,549 --> 00:05:35,136 differ by so few bits. One nice thing about what's retrieved is, 75 00:05:35,136 --> 00:05:40,669 with one exception, they're all faces. If we look at the retrieval you get by 76 00:05:40,669 --> 00:05:44,280 using euclidean distance on the raw pixels, 77 00:05:44,280 --> 00:05:47,730 then some of them are faces, but most of them aren't. 78 00:05:47,730 --> 00:05:52,641 So obviously the ordering coder has understood something about faces that 79 00:05:52,641 --> 00:05:56,360 isn't in the information about Euclidean distance. 80 00:05:56,360 --> 00:06:00,020 It's clearly giving much better retrieval. 81 00:06:00,020 --> 00:06:05,519 Let's take another example, here we took the image of a party scene and we treat 82 00:06:05,519 --> 00:06:09,231 of the images. And you can see that about half of them 83 00:06:09,231 --> 00:06:14,250 have images you would think of as fairly similar that other party scenes. 84 00:06:14,250 --> 00:06:19,200 The tip of the other party scenes with something bright in the middle, like the 85 00:06:19,200 --> 00:06:24,087 oiginal party scene, and you'll also notice that most of the bad matches also 86 00:06:24,087 --> 00:06:28,912 have something bright in the middle. So even though we're getting down to 256 87 00:06:28,912 --> 00:06:33,924 bit binary codes through a lot of hidden layers, it's still sensitive to quite a 88 00:06:33,924 --> 00:06:37,997 lot about the image structure and where the brighter patches are. 89 00:06:37,997 --> 00:06:41,757 If you look at what Euclidean distance does, it's much worse. 90 00:06:41,757 --> 00:06:45,830 Euclidean distance gets one other scene with a, a group of people. 91 00:06:45,830 --> 00:06:49,720 And then everything else is fairly dissimilar. 92 00:06:49,720 --> 00:06:54,780 You'll notice with Euclidean distance, it often gets very smooth images. 93 00:06:54,780 --> 00:06:59,661 That's because if you can't match the high frequency variation in the image, 94 00:06:59,661 --> 00:07:04,607 it's better to match its average then to get other stuff with high frequency 95 00:07:04,607 --> 00:07:08,462 variations out of phase. So when you got a complicated image, 96 00:07:08,462 --> 00:07:12,830 including in distance will typically find smooth images to match it. 97 00:07:12,830 --> 00:07:16,530 And that's because it's minimizing a squared error in pixel space. 98 00:07:16,530 --> 00:07:20,806 So obviously we'd like the image retrievable to be more sensitive to the 99 00:07:20,806 --> 00:07:25,668 content of the image that is stored kinds of objects in the relationships in image 100 00:07:25,668 --> 00:07:28,245 and less sensitive to the pixel intensities. 101 00:07:28,245 --> 00:07:32,932 We can do that by first training a big net to recognize lots of different kinds 102 00:07:32,932 --> 00:07:37,150 of object in real images and we show you how to do that in lecture five. 103 00:07:37,150 --> 00:07:42,149 Then we take the activity vector, in the last hidden layer of the big net, and use 104 00:07:42,149 --> 00:07:47,087 that as a representation of the image. This should be much better than the pixel 105 00:07:47,087 --> 00:07:51,840 intensities at capturing information about the kinds of objects in the image. 106 00:07:51,840 --> 00:07:56,631 So to see if this approach is likely to work, we use the net described in lecture 107 00:07:56,631 --> 00:08:01,305 five, the one the image that competition. So far, we've only tried it on Euclidan 108 00:08:01,305 --> 00:08:05,210 distance, between the activity of vectors in the last hidden layer. 109 00:08:05,210 --> 00:08:10,327 But obviously if it works for that, we could then that those activity vectors 110 00:08:10,327 --> 00:08:15,179 and build an ordering code around those to get them down to binary codes. 111 00:08:15,179 --> 00:08:19,100 So let's first see if it works with the Euclidean distance. 112 00:08:19,100 --> 00:08:24,660 It turns out it works really well. We don't know yet whether it will work 113 00:08:24,660 --> 00:08:28,892 with binary codes. So, in the column on the left, you see 114 00:08:28,892 --> 00:08:33,206 the query images. And then to the right of them, you see 115 00:08:33,206 --> 00:08:38,349 all the things that were retrieved. If you look at the elephant query image, 116 00:08:38,349 --> 00:08:43,767 you'll see that what gets retrieved is other elephants but elephants with very 117 00:08:43,767 --> 00:08:47,745 different poses. So those images wouldn't have a very good 118 00:08:47,745 --> 00:08:52,203 overlap in pixel space, although the overlap wouldn't be that bad. 119 00:08:52,203 --> 00:08:57,621 If you look at the Halloween pumpkins, you'll see that all the retrieved things 120 00:08:57,621 --> 00:09:02,826 are other Halloween pumpkins. And some of them, would have a pretty bad 121 00:09:02,826 --> 00:09:07,689 overlapping pixel space. Similar with the aircraft carriers, we 122 00:09:07,689 --> 00:09:12,867 retrieve other images of aircraft carrier that are very different. 123 00:09:12,867 --> 00:09:19,300 So we anticipate that if we could reduce the activity vector to short binary code. 124 00:09:19,300 --> 00:09:25,026 We would have a fast and effective way of retrieving similar images just by the 125 00:09:25,026 --> 00:09:29,178 content of the image. We'll see in lecture sixteen that we 126 00:09:29,178 --> 00:09:34,976 could actually combine the content of the image with the caption of the image, to 127 00:09:34,976 --> 00:09:37,410 get an even better representation.