1
00:00:00,000 --> 00:00:05,717
In this video, I'm going to explain a very
different way of using Hopfield's energy

2
00:00:05,717 --> 00:00:09,012
function.
We add some hidden units to the network.

3
00:00:09,012 --> 00:00:14,124
And what we are trying to do is make the
states of those hidden units represent an

4
00:00:14,124 --> 00:00:19,034
interpretation of the perception input
that's shown on the visible units.

5
00:00:19,034 --> 00:00:24,012
So, the idea is that the weights on
between units represent constraints on

6
00:00:24,012 --> 00:00:28,249
good interpretations.
And by finding a low energy state, we find

7
00:00:28,249 --> 00:00:34,500
a good interpretation of the input data.
Hopfield nets combine two ideals, the idea

8
00:00:34,500 --> 00:00:39,345
of that you can find a local energy
minimum by using a network of

9
00:00:39,345 --> 00:00:45,195
symmetrically connected binary threshold
units, and the idea that these local

10
00:00:45,195 --> 00:00:48,420
energy minima might correspond to
memories.

11
00:00:49,140 --> 00:00:54,420
There's a different way of using the
ability to find local minima.

12
00:00:55,340 --> 00:01:00,403
Instead of using the net to store
memories, we can use it to construct

13
00:01:00,403 --> 00:01:07,450
interpretations of the sensory input.
So, the idea is that we have the input

14
00:01:07,450 --> 00:01:13,004
represented by some visible units.
And we can structure an interpretation of

15
00:01:13,004 --> 00:01:18,704
that input in the set of hidden units.
So, the interpretation or explanation of

16
00:01:18,704 --> 00:01:23,820
the input is going to be a binary
configuration over the hidden units.

17
00:01:24,740 --> 00:01:29,325
The energy of the whole system will
represent the badness of that

18
00:01:29,325 --> 00:01:33,217
interpretation.
So, to get good interpretations according

19
00:01:33,217 --> 00:01:38,914
to our current model of the world, which
is in the energy function, we need to find

20
00:01:38,914 --> 00:01:43,987
low energy states of the hidden units,
given, the input represented by the

21
00:01:43,987 --> 00:01:50,047
visible units.
I want to give an example of this to make

22
00:01:50,047 --> 00:01:55,808
the idea clearer, in order to give the
example, I need to go into a little bit of

23
00:01:55,808 --> 00:02:00,574
detail, about what you can infer, when you
see a 2-D line in an image.

24
00:02:00,574 --> 00:02:04,700
What does that tell you about the
three-dimensional world?

25
00:02:06,340 --> 00:02:11,986
So, a 2-D line in an image could have been
caused by many different three-dimensional

26
00:02:11,986 --> 00:02:16,552
edges in the world.
If this blue dot is your eyeball and the

27
00:02:16,552 --> 00:02:21,280
red lines are two lines of sight coming
from the center of your eyeball,

28
00:02:21,700 --> 00:02:28,630
Then the black line is a possible 3-D edge
that would lead to a two-dimensional line

29
00:02:28,630 --> 00:02:34,454
on your retina.
Here's another 3-D edge that would lead to

30
00:02:34,454 --> 00:02:38,484
exactly the same thing on your retina.
And here's another one.

31
00:02:38,484 --> 00:02:42,447
And here's another one.
All of these different 3-D edges have

32
00:02:42,447 --> 00:02:47,930
exactly the same appearance in the image.
That's because we've lost the information

33
00:02:47,930 --> 00:02:52,489
about how far away the ends of the line
are along that line of sight.

34
00:02:52,489 --> 00:02:57,840
We know the end is somewhere along the
line of sight but we don't know the depth.

35
00:02:58,660 --> 00:03:04,850
So, if we assume that a straight 3-D edge
in the world is the cause of a straight

36
00:03:04,850 --> 00:03:11,332
2-D line in the image, then we've lost two
degrees of freedom of that 3-D edge, its

37
00:03:11,332 --> 00:03:15,848
depth at each end.
So, there is a whole family of 3-D edges

38
00:03:15,848 --> 00:03:22,531
that all correspond to the same 2-D line.
You can only see one of these 3-D edges at

39
00:03:22,531 --> 00:03:25,600
a time because they all get in the way of
each other.

40
00:03:26,240 --> 00:03:31,754
So now, we're in a position to see a
little example of what you might be able

41
00:03:31,754 --> 00:03:37,627
to do, if you can use the fact that you
can find low energy states of a network of

42
00:03:37,627 --> 00:03:42,140
binary units, to help you find
interpretations of sensory input.

43
00:03:43,560 --> 00:03:48,276
So, here's the example.
You imagine we see a line drawing, and we

44
00:03:48,276 --> 00:03:51,720
want to interpret it as a
three-dimensional thing.

45
00:03:53,460 --> 00:03:59,622
So, the data we have, let's suppose, is a
bunch of 2-D lines like the lines shown in

46
00:03:59,622 --> 00:04:03,578
the picture.
And for each possible line, we will have

47
00:04:03,578 --> 00:04:08,066
set aside a neuron.
Don't worry for now about the fact that,

48
00:04:08,066 --> 00:04:13,620
that will require too many neurons.
So, for every possible 2-D line, we have

49
00:04:13,620 --> 00:04:16,967
neuron.
In any one picture, only a few of the

50
00:04:16,967 --> 00:04:22,064
possible lines will be present.
And so , we'll activate just a few of

51
00:04:22,064 --> 00:04:26,020
those neurons.
So, I've shown two edges in that picture,

52
00:04:26,020 --> 00:04:31,078
activating two of the neurons.
And those are neurons that represent 2-D

53
00:04:31,078 --> 00:04:32,740
lines.
They're the data.

54
00:04:35,080 --> 00:04:41,333
Now, what we're going to do is have a
whole bunch of 3-D line units, one for

55
00:04:41,333 --> 00:04:47,942
each possible 3-D line or 3-D edge.
So, each of the 2-D line units could be

56
00:04:47,942 --> 00:04:53,619
the projection of many different possible
3-D lines. We therefore need to make the

57
00:04:53,619 --> 00:04:59,295
2-D line unit excite all those 3-D lines,
but we also need to make them all compete

58
00:04:59,295 --> 00:05:06,427
with one another, cuz you can only see one
of them at a time. So, here's an example

59
00:05:06,427 --> 00:05:11,881
where I have a stack of 3-D line units,
the green connections are excitrity

60
00:05:11,881 --> 00:05:17,630
connections coming from the 2-D line unit,
all of them with equal weight, saying, if

61
00:05:17,630 --> 00:05:22,200
this line unit is present, I'm going to
try and turn on all those.

62
00:05:22,540 --> 00:05:27,058
But in addition, we need competition
between those so that only one of them

63
00:05:27,058 --> 00:05:31,696
will turn on, and that's what the red
lines represent. And we do that for each

64
00:05:31,696 --> 00:05:36,516
2-D line unit. So, I'm just showing it to
you for the 2-D line units that happen to

65
00:05:36,516 --> 00:05:40,311
be active at present.
And again, don't worry about the fact that

66
00:05:40,311 --> 00:05:45,455
this would need far to many units.
Now, the story is not quite complete.

67
00:05:45,455 --> 00:05:50,913
We've now wired into the neural network
the information about projection that I

68
00:05:50,913 --> 00:05:53,164
showed on the previous slide,
I.e.,

69
00:05:53,164 --> 00:05:58,894
The neural network in those green and red
connections understands that each 2-D line

70
00:05:58,894 --> 00:06:04,420
can cross upon to many 3-D edges, but only
one of them should be present at a time.

71
00:06:05,200 --> 00:06:08,630
But now, we know a lot about how 3-D edges
connect.

72
00:06:08,630 --> 00:06:14,160
For example, when we see two 2-D lines
connect in the image, we think it's almost

73
00:06:14,160 --> 00:06:20,180
certain that they correspond to edges that
have the same depth at the point where the

74
00:06:20,180 --> 00:06:25,085
lines connect.
So, let's suppose that the two 3-D edges

75
00:06:25,085 --> 00:06:30,396
I've joined there correspond to having the
same depth at the point where the two 2-D

76
00:06:30,396 --> 00:06:33,746
lines join.
That means they should support each other.

77
00:06:33,746 --> 00:06:38,172
It doesn't have to be like that.
You could have a very funny viewpoint

78
00:06:38,172 --> 00:06:43,356
where one line ends at a different depth
from the other and you just happen to be

79
00:06:43,356 --> 00:06:46,959
at the viewpoint from which they coincide
on your retina.

80
00:06:46,959 --> 00:06:51,336
But that's very unlikely.
So, we're going to need to use the fact

81
00:06:51,336 --> 00:06:56,885
that we expect 2-D lines that coincide in
the image to correspond to 3-D edges that

82
00:06:56,885 --> 00:07:01,961
agree on the depth of that point.
So, we'll put in a lot of connections like

83
00:07:01,961 --> 00:07:06,344
that.
But there's an even stronger fact we can

84
00:07:06,344 --> 00:07:12,344
use which is that in our manufactured
world, we expect that quite often, 3-D

85
00:07:12,344 --> 00:07:18,180
edges will join in a right angle.
And so, for two particular 3-D edges that

86
00:07:18,180 --> 00:07:21,725
happen to agree in depth and join at a
right angle,

87
00:07:21,725 --> 00:07:24,923
We'll put in a particularly strong
connection.

88
00:07:24,923 --> 00:07:28,260
And I've indicated that by a thicker green
line.

89
00:07:29,040 --> 00:07:34,923
So, by putting in lots of connections like
that, we can indicate how we expect 3-D

90
00:07:34,923 --> 00:07:38,600
edges to go together to form a coherent
3-D object.

91
00:07:38,600 --> 00:07:43,599
And now, we have a network that contains
information about how edges go together in

92
00:07:43,599 --> 00:07:47,575
the world and about how edges project to
cause lines in the image.

93
00:07:47,575 --> 00:07:52,213
And so, if we give that network an image,
it should be able to come up with an

94
00:07:52,213 --> 00:07:56,430
interpretation of the image.
And for the image I'm showing you, there's

95
00:07:56,430 --> 00:08:01,128
two quite different interpretations.
It's called a Necker cube, and if you look

96
00:08:01,128 --> 00:08:03,960
at it long enough, it will flip in depth
on you.

97
00:08:03,960 --> 00:08:08,411
And this network would have two pretty
much equally deep energy minima that

98
00:08:08,411 --> 00:08:12,160
correspond to those two interpretations of
the Necker cube.

99
00:08:12,900 --> 00:08:18,096
Remember, this is all just a analogy so
you understand the idea of using low

100
00:08:18,096 --> 00:08:21,606
energy states as interpretations of
perceptial data.

101
00:08:21,606 --> 00:08:27,072
To actually build a proper model of what
happens when the Necker cube flips will be

102
00:08:27,072 --> 00:08:33,899
a lot more complicated than this.
So, if we decide we're going to use low

103
00:08:33,899 --> 00:08:39,430
energy states to represent good
interpretations, then we have two issues.

104
00:08:39,430 --> 00:08:44,629
The first is to do with search and I'm
going to deal with that in the next video.

105
00:08:44,629 --> 00:08:49,960
The search question is, how do we avoid
the hidden units getting trapped in poor

106
00:08:49,960 --> 00:08:55,438
local minima of the energy function?
The poor minima represent interpretations

107
00:08:55,438 --> 00:09:00,141
that are sub-optimal, given our current
model and the weights of the network.

108
00:09:00,141 --> 00:09:05,091
Can we do anything better than simply
going downhill in energy from some random

109
00:09:05,091 --> 00:09:09,811
starting state?
The second issue which seems even more

110
00:09:09,811 --> 00:09:15,029
difficult is how do we learn the weights
on the connections between hidden units

111
00:09:15,029 --> 00:09:17,412
and between visible units and hidden
units.

112
00:09:17,412 --> 00:09:22,822
Is the sum simple learning algorithm for
adjusting all those weights so that we get

113
00:09:22,822 --> 00:09:27,266
sensible perception interpretations?
And notice here we haven't got a

114
00:09:27,266 --> 00:09:30,938
supervisor anywhere.
We're just showing it input and we would

115
00:09:30,938 --> 00:09:36,090
like it to construct tons of activity in
the hidden units that represent sensible

116
00:09:36,090 --> 00:09:39,440
interpretations.
This seems like a rather tall order.