In this video, I'm going to describe some of the things that make it difficult to recognize objects in real scenes. We're incredibly good at this, and so it's very hard for us to realize how difficult it is to take a bunch of numbers that describe the intensities of pixels and go from there to the label of an object. There's all sorts of difficulties. We have to segment the object out. We have to deal with variations in lighting, in viewpoint. We have to deal with the fact that the definitions of objects are quite complicated. It's also possible that to get from an image to an object requires huge amounts of knowledge, even for the lower level processes that involve segmentation and dealing with viewpoint and lighting. If that's the case, it's gonna be very hard for any hand engineered program to be able to do a good job of those things. There are many reasons why it's hard to recognize objects and images. First of all, it's hard to segment out an object from the other things in an image. In the real world, we move around, and so we have motion cues. We also have two eyes, so we have stereo cues. You don't get those in static images. So it's very hard to tell which pieces go together as parts of the same object. Also, parts of an object can be hidden behind other objects, and so, you often don't see the whole of an object. You're so good at doing vision that you don't often notice this. Another thing that makes it very hard to recognize objects is that the intensity of a pixel is determined as much by the lighting as it is by the nature of the object. So, for example, a black surface in bright light will give you much more intense pixels than a white surface in very gloomy light. Remember, to recognize an object you've got to convert a bunch of numbers, that is the intensities of the pixels, into a class label. And, these intensities are varying for all sorts of reasons that have nothing to do with the nature of the object. Or nothing to do with the identity of the object. Objects can also deform in a variety of ways. So for relatively simple things like handwritten digits there's a wide variety of different shapes that have the same name. A two for example could be very italic with just a cusp instead of a loop or it could be a very loopy two with a, a big loop and very round. It's also the case that for many types of object, the class is defined more by what the object's for, than by its visual appearance. So consider chairs. There's a huge variety of things we call chairs, from armchairs to modern chairs made with steel frames and wood backs to basically anything you can sit on. One other thing that makes it hard to recognize objects, is that we have different viewpoints. So, there's a wide variety of viewpoints from which we can recognize a 3D object. Now, changes in viewpoint cause changes in the images that standard machine learning methods cannot cope with. The problem is that information hops about between the input dimensions. So, typically envision the input dimensions correspond to pixels, and, if an object moves in the world and you don't move your eyes to follow it, the information about the object will occur on different pixels. That's not the kind of thing we normally have to deal with in machine learning. Just to stress that point, suppose we had a medical database in which one of the inputs is the age of a patient and another input is the weight of the patient. And we start doing machine learning, and then we realize that some coder has actually changed which input I mentioned is coding which property. So for one of the coders they've put weight where they should have put age, and they put age where they should have put weight. Obviously, we wouldn't just carry on doing our learning. We'd try and do something to fix that. That's going to make everything go wrong. I call that phenomenon dimension hopping. When information jumps from one input dimension to another. And that's what Viewpoint does and it's something we need to fix. And preferably we'd like to fix it in a systematic way.