In this video, I'm going to talk about the use of binary codes for image retrieval. For retrieving documents, people like Google have such good methods already, the techniques like semantic hashing may not be of much value. But retrieving images is much more difficult and the methods that convert an image into a fairly large binary code of say, 256 bits, seem to work quite well. However, we don't want to do a very long sequencial search through vectors of 256 bits so semantic hashing can be used to first create a short list and then we can get better quality matches by using longer binary codes in a serial search. Now, we get to look at using binary codes for image retrieval. Image retrieval of present is typically done by using the captions. But why not use the images too? They obviously contain a lot more information than the captions. The basic problem is that pixels are not like words. Individual pixels don't tell us much about the content of an image. Obviously if you could recognize the objects in the images, then we'd have things that were much more like words. But recognizing objects is hard. At least it was when we first did this work. Now deep neural nets have got much better at it, and so that may well be the way to go. So if we're not going to recognize the objects, maybe what we should do is extract a vector that has information about the content of the image. And the obvious thing to extract is the real valued vector. But the problem is that matching real value vectors in the real data base is slow and it also requires a lot of storage for the real value vectors. If we can extract a fairly short binary vector that contains a lot of information about the image that's much easier to store and much faster to match, even faster is to use a two stage method. So first, we extract a short binary code of about 30 bids and that short binary code is used with semantic hashing to very rapidly get us a short list of promising images. So we simply take the short binary code, and flip a few bits in it to get candidate images. The candidate images can then be matched using 256 bit binary codes that are stored with each known image. To search for much better matches than can be found with a 28 bit binary code. Even a 256 bit binary code only requires four words of memory per image. And even though we're then going to do a serial on these binary codes. The search can be done very fast, because it only requires a few operations to compare two, 256 binary codes to find out how many bits they have in common. The question is, how good is a binary code of that size at retrieving images?. Are they going to find images that we think of as being similar? So here's a net that trained by Alex Krizhevsky. It's working on small color images so they're only 32 pixels by 32 pixels and he takes his input the red, green, and blue channels from those images so 3,000 inputs. He then expands that to a larger number of hidden units because we're going to go from real value inputs to logistic hidden units, which probably have less capacity. We then progressively decrease the number units in each layer until we get down to 256 bits. This encoder has about 67 million parameters. It's quite big. It takes a few days to train it on, in video GPU. And I'll exchange it on two million images. There's absolutely no theory to justify the architecture we used. We know we want a fairly deep net. It makes sense to make it get an arrow as we go up. But this particular architecture, where you have the number of units each layer, is just a guess. The interesting thing is, a guess like this already works quite well. And presumably, there are some other architectures that will work better. The first question to ask is how well does an ordering coder like this do at reconstructing the images? So, here is a face image and it's reconstruction and you can see that from the reconstruction you can tell the kind of an image it is. Here's another example, where it's a scene probably at a party, you can't really tell what kind of scene it is from the image but you might guess that there are a number of people involved or you might not. Here's an outdoor scene. And you can see that the reconstruction captures a lot of information about the accuracy. It captures the water in the sky and the hyphens of thin strip land. So lets look at the quality of retrievable we can do within ordering codes that gives those kinds of reconstruction. We'll start with a picture of Michael Jackson in the red square and Alex retrieved the most similar images and above each image you can see how many bits had differed from Michael Jackson body. And you'll notice they're all a fairly similar number of bits. In 256 bits, differing by only 61 bits is extraordinarily unlikely to happen by chance if they were random images. It has to be a pretty similar image to differ by so few bits. One nice thing about what's retrieved is, with one exception, they're all faces. If we look at the retrieval you get by using euclidean distance on the raw pixels, then some of them are faces, but most of them aren't. So obviously the ordering coder has understood something about faces that isn't in the information about Euclidean distance. It's clearly giving much better retrieval. Let's take another example, here we took the image of a party scene and we treat of the images. And you can see that about half of them have images you would think of as fairly similar that other party scenes. The tip of the other party scenes with something bright in the middle, like the oiginal party scene, and you'll also notice that most of the bad matches also have something bright in the middle. So even though we're getting down to 256 bit binary codes through a lot of hidden layers, it's still sensitive to quite a lot about the image structure and where the brighter patches are. If you look at what Euclidean distance does, it's much worse. Euclidean distance gets one other scene with a, a group of people. And then everything else is fairly dissimilar. You'll notice with Euclidean distance, it often gets very smooth images. That's because if you can't match the high frequency variation in the image, it's better to match its average then to get other stuff with high frequency variations out of phase. So when you got a complicated image, including in distance will typically find smooth images to match it. And that's because it's minimizing a squared error in pixel space. So obviously we'd like the image retrievable to be more sensitive to the content of the image that is stored kinds of objects in the relationships in image and less sensitive to the pixel intensities. We can do that by first training a big net to recognize lots of different kinds of object in real images and we show you how to do that in lecture five. Then we take the activity vector, in the last hidden layer of the big net, and use that as a representation of the image. This should be much better than the pixel intensities at capturing information about the kinds of objects in the image. So to see if this approach is likely to work, we use the net described in lecture five, the one the image that competition. So far, we've only tried it on Euclidan distance, between the activity of vectors in the last hidden layer. But obviously if it works for that, we could then that those activity vectors and build an ordering code around those to get them down to binary codes. So let's first see if it works with the Euclidean distance. It turns out it works really well. We don't know yet whether it will work with binary codes. So, in the column on the left, you see the query images. And then to the right of them, you see all the things that were retrieved. If you look at the elephant query image, you'll see that what gets retrieved is other elephants but elephants with very different poses. So those images wouldn't have a very good overlap in pixel space, although the overlap wouldn't be that bad. If you look at the Halloween pumpkins, you'll see that all the retrieved things are other Halloween pumpkins. And some of them, would have a pretty bad overlapping pixel space. Similar with the aircraft carriers, we retrieve other images of aircraft carrier that are very different. So we anticipate that if we could reduce the activity vector to short binary code. We would have a fast and effective way of retrieving similar images just by the content of the image. We'll see in lecture sixteen that we could actually combine the content of the image with the caption of the image, to get an even better representation.