1 00:00:00,000 --> 00:00:05,680 In this video I'm going to talk about the issue of viewpoint invariance. 2 00:00:06,180 --> 00:00:11,697 Each time we look at an object in the seen, we typically have a different view 3 00:00:11,697 --> 00:00:14,668 point. So the object shares up on different 4 00:00:14,668 --> 00:00:17,921 pixel. This makes okay recognition very unlike 5 00:00:17,921 --> 00:00:23,934 most machine learning tasks and I'm going to talk about various ways of trying to do 6 00:00:23,934 --> 00:00:27,824 with that issue. And number of different ways have been 7 00:00:27,824 --> 00:00:31,220 suggested for coping with view point variations. 8 00:00:31,540 --> 00:00:35,770 We're so good at it that we don't really appreciate how difficult it is. 9 00:00:35,770 --> 00:00:40,412 It's one of the main difficulties in making computers perceive, and there still 10 00:00:40,412 --> 00:00:44,820 aren't generally accepted solutions, either in engineering or in psychology. 11 00:00:45,880 --> 00:00:51,000 The first approach is to use redundant invariant features. 12 00:00:51,680 --> 00:00:56,862 The second approach is to put a box round the object so that you can normalize the 13 00:00:56,862 --> 00:01:01,131 pixels. The third approach is to use replicated 14 00:01:01,131 --> 00:01:06,110 features, and pool them. This is called convolutional neural nets. 15 00:01:06,110 --> 00:01:12,146 I'll go to that in great detail. And the first approach we shall talk about 16 00:01:12,146 --> 00:01:17,076 at the end of the lecture is to use a hierarchy of parts and to explicitly 17 00:01:17,076 --> 00:01:21,481 represent the places of the parts relative to the camera or retina. 18 00:01:21,481 --> 00:01:26,740 So, the invariant feature approach says you should extract a large and redundant 19 00:01:26,740 --> 00:01:30,224 set of features. And they should be features that are 20 00:01:30,224 --> 00:01:35,220 invariant under the transformations like translation and rotational scaling. 21 00:01:35,880 --> 00:01:39,360 So here's an example of an invariant feature. 22 00:01:39,680 --> 00:01:44,380 It's a pair of roughly parallel lines, with a red dot between them. 23 00:01:45,460 --> 00:01:50,573 That's actually being suggested as the feature the baby herring gulls use for 24 00:01:50,573 --> 00:01:55,424 knowing what to peck for food. If you paint that feature on a piece of 25 00:01:55,424 --> 00:01:59,622 wood, they'll peck at the appropriate place on the piece of wood. 26 00:01:59,622 --> 00:02:04,673 With enough invariant features, there's only one way to assemble them into an 27 00:02:04,673 --> 00:02:10,249 object or an image. You don't actually need to represent the 28 00:02:10,249 --> 00:02:15,404 relationships between features directly because those relationships are captured 29 00:02:15,404 --> 00:02:19,451 by other features. This has been pointed out for strings of 30 00:02:19,451 --> 00:02:24,687 letters by psychologist called Wayne [UNKNOWN] it's been pointed out in vision 31 00:02:24,687 --> 00:02:28,925 by Shimon Ullman. And, it's a sort of acute point that all 32 00:02:28,925 --> 00:02:35,230 we need is a big bag of features, because with overlapping and redundant features. 33 00:02:35,230 --> 00:02:38,994 One feature will tell you how two other features are related. 34 00:02:38,994 --> 00:02:43,314 Unfortunately, if you're doing recognition, you're going to get a whole 35 00:02:43,314 --> 00:02:47,388 bunch of features that are composed of parts of different objects. 36 00:02:47,388 --> 00:02:50,288 And they'll be very misleading for recognition. 37 00:02:50,288 --> 00:02:54,732 So you'd like to avoid forming features from parts of different objects. 38 00:02:54,732 --> 00:02:58,250 A second approach is what I call judicious normalization. 39 00:02:58,250 --> 00:03:03,434 So if you look at that upside down capital letter R on the right, I put a box around 40 00:03:03,434 --> 00:03:04,977 it. Not very well, in fact. 41 00:03:04,977 --> 00:03:07,940 And I've labeled a top and a front for that box. 42 00:03:08,700 --> 00:03:16,003 And relative to that box, the R has for example a vertical stroke at the back, and 43 00:03:16,003 --> 00:03:22,418 it has a loop facing forwards at the top. So if we describe features of the r 44 00:03:22,418 --> 00:03:25,643 relative to that box, they're going to be invariant. 45 00:03:25,643 --> 00:03:30,883 This is assuming it's a rigid shape. Putting a box around a rigid shape solves 46 00:03:30,883 --> 00:03:35,586 the dimension hopping problem. It gets rid of the effect of changes in 47 00:03:35,586 --> 00:03:38,945 viewpoint. If we choose the box correctly, the same 48 00:03:38,945 --> 00:03:43,446 part of an object would always occur on the same normalized pixels. 49 00:03:43,446 --> 00:03:48,687 It doesn't have to be a rectangular box. We can provide invariance to not only 50 00:03:48,687 --> 00:03:53,390 translation and rotation scale but also things like sheer and stretch. 51 00:03:53,390 --> 00:03:57,460 Unfortunately, choosing the box is difficult. 52 00:03:57,460 --> 00:04:01,551 It's difficult because we might have segmentation errors. 53 00:04:01,551 --> 00:04:06,504 We might have occlusion so you can't just shrink a box around things. 54 00:04:06,504 --> 00:04:12,092 We might have unusual orientations. That example of the upside down R makes it 55 00:04:12,092 --> 00:04:18,090 clear that we have to use our knowledge of what the shape is to help us decide what 56 00:04:18,090 --> 00:04:21,660 the box is. If, for example, we had a character that 57 00:04:21,660 --> 00:04:27,373 was like a lowercase D, but with an extra stroke coming out of the loop of the D. 58 00:04:27,373 --> 00:04:31,372 We would see that as an upright one of those characters. 59 00:04:31,372 --> 00:04:36,469 So it's a chicken and egg problem. In order to get the box right, we need to 60 00:04:36,469 --> 00:04:40,376 recognize the shape. In order to recognize the shape, we need 61 00:04:40,376 --> 00:04:43,826 to get the box right. An aside here for psychologists. 62 00:04:43,826 --> 00:04:48,970 Many psychologists think we do mental rotation to deal with shapes that aren't 63 00:04:48,970 --> 00:04:51,639 oriented right. This is complete nonsense. 64 00:04:51,639 --> 00:04:56,588 That capital letter R you recognize perfectly well before you do any mental 65 00:04:56,588 --> 00:04:59,908 rotation. Indeed, you need to recognize that it's an 66 00:04:59,908 --> 00:05:03,620 R and it's upside down, in order to know how to rotate it. 67 00:05:03,900 --> 00:05:08,556 You use mental rotation for dealing with judgments like handedness. 68 00:05:08,556 --> 00:05:11,614 That is, is it a correct R or mirror image R? 69 00:05:11,614 --> 00:05:15,090 You can't tell that without doing mental rotation. 70 00:05:15,090 --> 00:05:19,634 The mental rotation is not used for dealing with the fact that it's upside 71 00:05:19,634 --> 00:05:24,057 down when we want to recognize it. The brute force normalization approach 72 00:05:24,057 --> 00:05:27,328 works like this. You use well segmented, upright images 73 00:05:27,328 --> 00:05:32,418 that you can judiciously put a box around when you train the recognizer, and then at 74 00:05:32,418 --> 00:05:36,962 test time, when you have to deal with cluttered images, you try all possible 75 00:05:36,962 --> 00:05:39,810 boxes in a whole range of positions and scales. 76 00:05:39,810 --> 00:05:42,725 This approach is widely used in computer vision. 77 00:05:42,725 --> 00:05:47,889 Particularly for detecting upright things like faces or house numbers in unsegmented 78 00:05:47,889 --> 00:05:50,926 images. It's much more efficient if they recognize 79 00:05:50,926 --> 00:05:55,847 they can cope with some variation in the position and scale so that we can use a 80 00:05:55,847 --> 00:05:58,399 course grid when trying on possible boxes.