1
00:00:00,000 --> 00:00:05,880
In this video, I'm going to talk about the
storage capacity of Hopfield nets.

2
00:00:06,340 --> 00:00:11,927
Their ability to store a lot of memories
is limited by what are called spurious

3
00:00:11,927 --> 00:00:15,488
memories.
These occur when two nearby energy minima

4
00:00:15,488 --> 00:00:18,910
combine to make a new minimum in the wrong
place.

5
00:00:18,910 --> 00:00:24,637
Attempts to remove these spurious minima
eventually led to a very interesting way

6
00:00:24,637 --> 00:00:30,224
of doing learning in things considerably
more complicated than a basic Hopfield

7
00:00:30,224 --> 00:00:32,180
net.
At the end of the video,

8
00:00:32,180 --> 00:00:37,637
I'll also talk about a curious historical
rediscovery where the physicist trying to

9
00:00:37,637 --> 00:00:42,770
increase the capacity of Hopfield nets,
rediscovered the perceptron convergence

10
00:00:42,770 --> 00:00:48,054
procedure.
Off to Hopfield, invented Hopfield nets as

11
00:00:48,054 --> 00:00:52,831
memory storage devices.
The field became obsessed by the storage

12
00:00:52,831 --> 00:00:57,981
capacity of a Hopfield net.
Using Hopfield Storage Rule for a totally

13
00:00:57,981 --> 00:01:01,788
connected net, the capacity is about 0.15N
memories.

14
00:01:01,788 --> 00:01:07,460
That is, if you have N binary threshold
units, the number of memories you can

15
00:01:07,460 --> 00:01:13,207
store is about 0.15N before memories start
getting confused with one another.

16
00:01:13,207 --> 00:01:18,880
So that's the number you can store and
still hope to retrieve them sensibly.

17
00:01:20,400 --> 00:01:26,009
Each memory is a random configuration of
the N units, so it has N bits of

18
00:01:26,009 --> 00:01:29,850
information it.
And so, the total information being

19
00:01:29,850 --> 00:01:34,000
stored, in a Hopfield net is about 0.15N
squared bits.

20
00:01:35,360 --> 00:01:40,614
This doesn't make efficient use of the
bits that are required to store the

21
00:01:40,614 --> 00:01:43,906
weights.
In other words, if you look at how many

22
00:01:43,906 --> 00:01:49,091
bits the computer is using to store the
weights, it's using well over 0.15N

23
00:01:49,091 --> 00:01:53,785
squared2 bits to store the weights.
And therefore, this kind of distributed

24
00:01:53,785 --> 00:01:59,249
memory and local energy minima is not
making efficient use of the bits in the

25
00:01:59,249 --> 00:02:03,043
computer.
We can analyze how many bits we should be

26
00:02:03,043 --> 00:02:07,560
able to store if we were making efficient
use of the bits in the computer.

27
00:02:07,980 --> 00:02:11,620
Those N squared weights and biases in the
net.

28
00:02:11,620 --> 00:02:16,988
And after storing M memories, each
connection weight has an integer value in

29
00:02:16,988 --> 00:02:22,710
the range -M to M. That's because we
increase it by one or decrease it by one

30
00:02:22,710 --> 00:02:28,220
each time we store a memory, assuming that
we used states of -one and one.

31
00:02:28,220 --> 00:02:34,132
Now, of course, not all values will be
equiprobable, so we could compress that

32
00:02:34,132 --> 00:02:37,666
information.
But ignoring that, the number bits it

33
00:02:37,666 --> 00:02:44,156
would take us to store a connection rate
in a naive way is log 2M + one, Cuz that's

34
00:02:44,156 --> 00:02:49,420
the number of alternative connection rates
and that's a log to the base two.

35
00:02:49,900 --> 00:02:55,483
And so, the total number of bits of
computer memory that we use is of the

36
00:02:55,483 --> 00:03:00,277
order of N squared log 2M + one.
So, notice that, that scales

37
00:03:00,622 --> 00:03:04,249
logarithmically with M.
Whereas, if you store things in the way

38
00:03:04,249 --> 00:03:08,798
that Hopfield suggests, you get this
constants 0.15 instead of something this

39
00:03:08,798 --> 00:03:12,540
scale is logarithmically.
So, we're not so worried about the fact

40
00:03:12,540 --> 00:03:16,859
that the constant is a lot less than two,
What we're worried about is this

41
00:03:16,859 --> 00:03:20,141
logarithmic scaling.
That shows we ought to be able to do

42
00:03:20,141 --> 00:03:26,466
something better.
If we ask, what limits the capacity of a

43
00:03:26,466 --> 00:03:32,439
Hopfield net? What is it that causes it to
break down? Then, its merging of energy

44
00:03:32,439 --> 00:03:35,500
minima.
So, each time we memorize a binary

45
00:03:35,500 --> 00:03:39,980
configuration, we hope that we'll create a
new energy minima.

46
00:03:40,280 --> 00:03:45,157
So, we might have a state space for all
the states of the net being depicted

47
00:03:45,157 --> 00:03:48,943
horizontally here, and the energy being
depicted vertically.

48
00:03:48,943 --> 00:03:54,141
And, we might have one en, energy minimum
for the blue pattern and another for the

49
00:03:54,141 --> 00:03:58,965
green pattern.
But, if those two patents are nearby, what

50
00:03:58,965 --> 00:04:03,535
will happen is we won't get two seperate
minima. They'll merge to create one

51
00:04:03,535 --> 00:04:08,465
minimum at an intermediate location. And
that means, we can't distinguish those two

52
00:04:08,465 --> 00:04:13,396
seperate memories, and indeed we'll recall
something, that's a blend of them rather

53
00:04:13,396 --> 00:04:17,900
than the individual memories.
That's what limits the capacity of a

54
00:04:17,900 --> 00:04:20,960
Hopfield net, that kind of merging of
nearby minima.

55
00:04:22,340 --> 00:04:27,213
One thing I should mention is this
picture, is a big misrepresentation. The

56
00:04:27,213 --> 00:04:32,284
states of a Hopfield matter really, the
corners of a hyper cube. And, it's not

57
00:04:32,284 --> 00:04:37,290
very good to show, the corners of a hyper
cube, as if they were continous one

58
00:04:37,290 --> 00:04:43,353
dimensional horizontal space.
One very interesting idea that came out of

59
00:04:43,353 --> 00:04:47,869
thinking about how to improve the crest of
the Hopfield net is the idea of

60
00:04:47,869 --> 00:04:51,029
unlearning.
This was first suggested by Hopfield,

61
00:04:51,029 --> 00:04:54,900
Feinstine and Palmer, who suggested the
following strategies.

62
00:04:55,440 --> 00:05:00,624
You left a net settle from a random
initial state, and then you do unlearning.

63
00:05:00,624 --> 00:05:06,164
That is whatever binary state it settles
to, you apply opposite of the storage

64
00:05:06,164 --> 00:05:09,453
rule.
I think you can see that with the previous

65
00:05:09,453 --> 00:05:13,800
example, that red merge minimum.
If you let the net settle there and did

66
00:05:13,800 --> 00:05:18,944
some unlearning on that merge minimum,
you'd get back the two separate minima cuz

67
00:05:18,944 --> 00:05:24,626
you'd pull up that red point.
So, by getting rid of deep spurious

68
00:05:24,626 --> 00:05:28,380
minima, we can actually increase the
memory capacity.

69
00:05:29,280 --> 00:05:34,425
Hopfield, Feinstein and Palmer showed that
this actually worked, but they didn't have

70
00:05:34,425 --> 00:05:37,120
a good analysis of what was really going
on.

71
00:05:38,660 --> 00:05:44,633
Francis Crick, one of the discovers of the
structure of DNA, and Graham Micherson,

72
00:05:44,633 --> 00:05:50,681
proposed that unlearning might be what's
going on during REM sleep, that is Rapid

73
00:05:50,681 --> 00:05:55,329
Eye Movement sleep.
So, the idea was that during the day, you

74
00:05:55,329 --> 00:05:58,800
store lots of things, and you'll get
spurious minima.

75
00:05:59,160 --> 00:06:05,280
Then at night, you put the network in a
random state, you settle to a minimum,

76
00:06:05,280 --> 00:06:11,400
And you unlearn what you settled to.
And that actually explains a big puzzle.

77
00:06:11,740 --> 00:06:16,141
This is a puzzle that doesn't seem to
puzzle most people that study sleep but it

78
00:06:16,141 --> 00:06:19,119
ought to.
Each night, you go to sleep and you dream

79
00:06:19,119 --> 00:06:24,036
for several hours. When you wake up in the
morning, those dreams are all gone. Well,

80
00:06:24,036 --> 00:06:28,953
they're not quite all gone. The dream you
had just before you woke up, you can get

81
00:06:28,953 --> 00:06:34,053
into short term memory and you'll remember
it for a while. And if you think about it,

82
00:06:34,053 --> 00:06:39,032
you might remember it for a long time.
But, we know perfectly well that if we'd

83
00:06:39,032 --> 00:06:44,068
woken you up at other times in the night,
you'd have been having other dreams, and

84
00:06:44,068 --> 00:06:48,481
in the morning their just not there.
So, it looks like you're simply not

85
00:06:48,481 --> 00:06:53,330
storing what you're dreaming about, and
the question is, why? In fact, why do you

86
00:06:53,330 --> 00:06:57,771
bother to dream at all?
Dreaming is paradoxical and that the state

87
00:06:57,771 --> 00:07:02,631
of your brain looks extremely like the
state of your brain when you're awake,

88
00:07:02,631 --> 00:07:07,491
except that it's not being driven by real
input. It's being driven by a relay

89
00:07:07,491 --> 00:07:10,900
station just after the real input called
the thalamus.

90
00:07:11,400 --> 00:07:16,201
So the Crick and Mitchison theory at least
explains, functionally, what the point of

91
00:07:16,201 --> 00:07:18,920
dreams is, is to get rid of the spurious
minima.

92
00:07:20,920 --> 00:07:25,858
But, there's another problem with
unlearning, which is more mathematical

93
00:07:25,858 --> 00:07:29,335
problem, Which is, how much unlearning
should we do?

94
00:07:29,335 --> 00:07:34,899
Now, given more you've seen in the school
so far, a real solution to that problem

95
00:07:34,899 --> 00:07:40,602
will be to show that unlearning is part of
the process of fitting a model to data.

96
00:07:40,602 --> 00:07:45,888
And, if you do maximum likelihood fitting
of that model, then unlearning will

97
00:07:45,888 --> 00:07:49,087
automatically come out and fit into the
model.

98
00:07:49,087 --> 00:07:53,400
And what's more, you'll know exactly how
much unlearning to do.

99
00:07:53,940 --> 00:07:58,289
So, what we're going to try and do is
derive on learning as the right way to

100
00:07:58,289 --> 00:08:02,102
minimize a cost function.
Where the cost function is, how well your

101
00:08:02,102 --> 00:08:05,380
neural net models the data that you saw
during the day.

102
00:08:07,220 --> 00:08:11,887
Before we get to that, I want to talk a
little bit about ways that physicists

103
00:08:11,887 --> 00:08:15,464
discovered for increasing the capacity of
the Hopfield net.

104
00:08:15,464 --> 00:08:18,131
As I said, this was a big obsession with
the field.

105
00:08:18,495 --> 00:08:23,405
I think it's because physicists really
love the idea that math they already know

106
00:08:23,405 --> 00:08:27,649
might explain how the brain works.
That means, post doctoral fellows in

107
00:08:27,649 --> 00:08:31,831
physics who can't get a job in physics
might be able to get a job in

108
00:08:31,831 --> 00:08:35,492
neuroscience.
So, there are a very large number of

109
00:08:35,492 --> 00:08:40,260
papers published in physics journals about
Hopfield and their storage capacity.

110
00:08:40,580 --> 00:08:44,717
Eventually, a very smart student called,
Elizabeth Gardner, figured out that

111
00:08:44,717 --> 00:08:49,245
there's actually a much better storage
rule if you were concerned about capacity.

112
00:08:49,245 --> 00:08:52,320
And that it would use the full capacity of
the weights.

113
00:08:52,320 --> 00:08:55,340
And I think this storage rule will be
familiar to you.

114
00:08:56,100 --> 00:09:01,111
Instead of trying to store vectors in one
go, what we're going to do is we're going

115
00:09:01,111 --> 00:09:05,458
to cycle through the training set many
times. So, we lose our nice online

116
00:09:05,458 --> 00:09:10,409
property that you only have to go through
the data once. But in return, we're going

117
00:09:10,409 --> 00:09:15,922
to gain, more efficient storage.
What we going to do is we going to use the

118
00:09:15,922 --> 00:09:20,818
perceptual convergent procedure to train
each unit to have the correct state given

119
00:09:20,818 --> 00:09:25,420
the states of all the other units in that
global vector that we want to store.

120
00:09:25,700 --> 00:09:31,189
So, you take your net, you put it into the
memory state you want to store, and then

121
00:09:31,189 --> 00:09:36,679
you take each unit separately and say,
would this unit adopt the state I want for

122
00:09:36,679 --> 00:09:39,744
it, given the states of all the other
units?

123
00:09:39,744 --> 00:09:43,238
If it would, you leave its incoming
weights alone.

124
00:09:43,238 --> 00:09:48,656
If it wouldn't, you change its incoming
weights in the weights specified by

125
00:09:48,656 --> 00:09:54,360
convergence procedures. And notice, these
would be integer changes to the weights.

126
00:09:55,040 --> 00:09:59,573
You may have to do this several times, and
of course, if you give it to many

127
00:09:59,573 --> 00:10:03,925
memories, this won't converge. You only
get convergence with a perceptron

128
00:10:03,925 --> 00:10:08,700
convergence procedure if there is a set of
weights that will solve the problem.

129
00:10:08,700 --> 00:10:13,354
But assuming there is, this is much more
efficient way to store memories in a

130
00:10:13,354 --> 00:10:17,840
Hopfield net.
This technique is also being developed in

131
00:10:17,840 --> 00:10:21,443
another field, statistics.
And statisticians call the technique

132
00:10:21,443 --> 00:10:24,940
pseudo-likelihood.
The idea is to get one thing right given

133
00:10:24,940 --> 00:10:28,647
all the other things.
And so, with high dimensional data, if you

134
00:10:28,647 --> 00:10:33,429
want to build a model of it, the idea is
you build a model that tries to get the

135
00:10:33,429 --> 00:10:37,853
value on one dimension right given the
values on all the other dimensions.

136
00:10:37,853 --> 00:10:43,352
The main difference between the perceptron
convergence procedure that's normally used

137
00:10:43,352 --> 00:10:47,896
and pseudo-likelihood is that, in the
Hopfield net, the weights are symmetric.

138
00:10:47,896 --> 00:10:52,260
So, we have to get two sets of gradients
for each weight and average them.

139
00:10:52,260 --> 00:10:57,048
But, apart from that, the way to use the
full capacity of a Hopfield net is to use

140
00:10:57,048 --> 00:11:01,660
the perceptron convergence procedure and
to go through the data several times.