1
00:00:00,000 --> 00:00:06,311
In this video, I'm going to explain how
adding noise can help systems escape from

2
00:00:06,311 --> 00:00:10,362
local minima.
And, I'm going to show what you have to do

3
00:00:10,362 --> 00:00:15,427
to the units in Hopfield net to add noise
in the appropriate way.

4
00:00:15,427 --> 00:00:21,816
I'm not going to introduce the idea that
we confined better minima by using noise.

5
00:00:21,816 --> 00:00:27,971
So, Hopfield net always makes decisions
that reduce the energy, or if it doesn't

6
00:00:27,971 --> 00:00:31,400
state of the unit, the energy stays the
same.

7
00:00:31,400 --> 00:00:35,199
This makes it impossible to climb out of a
local minimum.

8
00:00:35,199 --> 00:00:39,998
So, if you look at the landscape here.
If we get into the local minimum A,

9
00:00:39,998 --> 00:00:45,064
there's no way we're going to get over the
energy barrier to get to the better

10
00:00:45,064 --> 00:00:48,197
minimum B because we can't go uphill in
energy.

11
00:00:48,197 --> 00:00:53,530
If we add random noise, we can escape from
poor minima, especially minima that is

12
00:00:53,530 --> 00:00:58,130
shallow, that is, ones that don't have big
energy barriers around them.

13
00:00:58,130 --> 00:01:03,342
It turns out, rather than using a fixed
noise level, the most effective strategy

14
00:01:03,342 --> 00:01:08,290
is to start with a lot of noise which
allows you to explore the space on a

15
00:01:08,290 --> 00:01:13,767
coarse scale and find the generally good
regions of the space, and then to decrease

16
00:01:13,767 --> 00:01:17,329
the noise level.
With a lot of noise, you can cross big

17
00:01:17,329 --> 00:01:20,694
barriers.
As you decrease the noise level, you start

18
00:01:20,694 --> 00:01:26,702
concentrating on the best nearby minima.
If you slowly reduce the noise, so the

19
00:01:26,702 --> 00:01:31,680
system ends up in a deep minimum, that's
called simulated annealing.

20
00:01:31,680 --> 00:01:37,178
And this ideal was, propogated by
Kirkpatrick at around the same time as

21
00:01:37,178 --> 00:01:42,375
Hopfield nets were proposed.
So, the reason for simulated annealing is

22
00:01:42,375 --> 00:01:48,174
because the temperature, in a physical
system, or in a simulated system with a

23
00:01:48,174 --> 00:01:52,311
energy function,
Affects the transition probabilities.

24
00:01:52,311 --> 00:01:58,571
So, in a high temperature system, the
probability of going uphill from B to A is

25
00:01:58,571 --> 00:02:03,088
lower than the probability of going
downhill from A to B.

26
00:02:03,088 --> 00:02:07,362
But it's not much lower.
In effect, the temperature flattens the

27
00:02:07,362 --> 00:02:11,741
energy landscape, and so the little black
dots are meant to be particles.

28
00:02:11,741 --> 00:02:16,606
And what we are imagining is particles
moving about according to the transition

29
00:02:16,606 --> 00:02:20,803
probabilities that you get with an energy
function and a temperature.

30
00:02:20,803 --> 00:02:25,304
And this might be a typical distribution
if you're on the system of high

31
00:02:25,304 --> 00:02:30,170
temperature where it's easier to cross
barriers, but it's also hard to stay in a

32
00:02:30,170 --> 00:02:34,610
deep minimum once you've got that.
If you are in the system of much lower

33
00:02:34,610 --> 00:02:38,876
temperature,
Then your probability of crossing barriers

34
00:02:38,876 --> 00:02:42,640
gets much smaller but your ratio gets much
better.

35
00:02:42,900 --> 00:02:48,430
So, the ratio of the probability of going
from A to B versus the probability of

36
00:02:48,430 --> 00:02:52,840
going from B to A is much better in the
low temperature system.

37
00:02:53,100 --> 00:02:58,879
And so, if we run it long enough, we would
expect all of the particles to end up in

38
00:02:58,879 --> 00:03:01,922
B.
But if we just run it for a long time at

39
00:03:01,922 --> 00:03:06,322
low temperature, it will take a very long
time for particles to escape from A.

40
00:03:06,322 --> 00:03:11,065
And it turns out a good compromise is to
start at a high temperature and gradually

41
00:03:11,065 --> 00:03:16,201
reduce the temperature.
The way we get noise in to Hopfield net is

42
00:03:16,201 --> 00:03:22,366
to replace the binary threshold units by
binary stochastic units and make biased

43
00:03:22,366 --> 00:03:27,184
random decisions.
And the amount of noise is controlled by

44
00:03:27,184 --> 00:03:31,827
something called temperature,
Which you'll see in a minute in the

45
00:03:31,827 --> 00:03:34,948
equation.
Raising the noise level is equivalent to

46
00:03:34,948 --> 00:03:38,320
decreasing all the energy gaps between
configurations.

47
00:03:39,500 --> 00:03:47,290
So, this is our normal logistic equation.
But with the energy gap scaled by a

48
00:03:47,290 --> 00:03:53,514
temperature.
If the temperature is very high, that

49
00:03:53,514 --> 00:03:58,969
exponent will be roughly zero, so the
right hand side will be one over one plus

50
00:03:58,969 --> 00:04:03,871
one. And so, the probability of the unit
turning on will be about a half.

51
00:04:03,871 --> 00:04:07,945
It'll be in it's on and off states, more
or less equally off.

52
00:04:09,080 --> 00:04:15,645
As we lower the temperature,
Depending on the sign of delta E, the unit

53
00:04:15,645 --> 00:04:21,757
will become either more and more firmly on
and more and more firmly off.

54
00:04:21,757 --> 00:04:27,530
At zero temperature, which is what we're
be using in a Hopfield net,

55
00:04:27,530 --> 00:04:34,066
Then the sign of delta E determines
whether the right hand side goes to zero

56
00:04:34,066 --> 00:04:38,479
or goes to one.
But, with T zero, it will either be zero

57
00:04:38,479 --> 00:04:42,344
or one on the right hand side.
And so, the unit will behave

58
00:04:42,344 --> 00:04:45,877
deterministically and that's a binary
threshold unit.

59
00:04:45,877 --> 00:04:50,476
It will always adopt whatever of the two
states is the lowest energy.

60
00:04:50,476 --> 00:04:55,874
So, the energy gap we saw on a previous
slide, and it's just the difference in the

61
00:04:55,874 --> 00:05:01,140
energy of the whole system depending on
whether unit I is off, or the unit I is

62
00:05:01,140 --> 00:05:04,612
on.
Although simulated annealing is a very

63
00:05:04,612 --> 00:05:09,909
powerful method for improving searches
that get stuck in local optima, and

64
00:05:09,909 --> 00:05:15,708
although it was influential in leading
Terry Sejnowski and I to the ideas behind

65
00:05:15,708 --> 00:05:21,435
Boltzmann machines, it's actually a big
distraction from understanding Boltzmann

66
00:05:21,435 --> 00:05:25,585
machines.
So, I'm not going to talk about it anymore

67
00:05:25,585 --> 00:05:29,139
in this course even though it's a very
interesting idea.

68
00:05:29,139 --> 00:05:33,835
And, from now on, I'm going to use binary
stochastic units that have a temperature

69
00:05:33,835 --> 00:05:37,043
of one.
That is, it's the standard logistic

70
00:05:37,043 --> 00:05:41,885
function in the energy gap.
So, one concept that you need to

71
00:05:41,885 --> 00:05:47,526
understand in order to understand the
learning procedure for both the machines,

72
00:05:47,526 --> 00:05:53,052
is the concept of thermal equilibrium.
And because we're setting the temperature

73
00:05:53,052 --> 00:05:57,120
to one, this the concept of thermal
equilibrium at a fix temperature.

74
00:05:57,120 --> 00:06:01,845
It's a difficult concept. Most people
think that it means the system is settled

75
00:06:01,845 --> 00:06:06,631
down and isn't changing anymore. That's
normally what equilibrium means. But it's

76
00:06:06,631 --> 00:06:10,280
not the states of the individual units
that are settled down.

77
00:06:10,920 --> 00:06:16,411
The individual units are still rattling
around at thermal equilibrium, and less

78
00:06:16,411 --> 00:06:22,111
temperature zero. The thing that settles
down is the probability distribution over

79
00:06:22,111 --> 00:06:27,672
configurations. That's a difficult concept
the first time you meet it, and so I'm

80
00:06:27,672 --> 00:06:32,520
going to give you an example.
The probability distribution settles to a

81
00:06:32,520 --> 00:06:36,145
particular distribution called the
Stationary Distribution.

82
00:06:36,145 --> 00:06:41,000
The stationary distribution is determined
by the energy function of the system.

83
00:06:41,260 --> 00:06:45,550
And, in fact, in the stationary
distribution, the probability of any

84
00:06:45,550 --> 00:06:49,580
configuration is proportional to each of
the minus its energy.

85
00:06:50,000 --> 00:06:55,405
A nice intuitive way to think about
thermal equilibrium is to imagine a huge

86
00:06:55,405 --> 00:07:00,810
ensemble of identical systems that all
have exactly the same energy function.

87
00:07:00,810 --> 00:07:06,356
So, imagine a very large number of
stochastic Hopfield nets all with the same

88
00:07:06,356 --> 00:07:09,725
weights.
Now, in that huge ensemble, we can define

89
00:07:09,725 --> 00:07:15,411
the probability of configuration as the
fraction of the systems that are in that

90
00:07:15,411 --> 00:07:19,343
configuration.
So, now we can understand what's happening

91
00:07:19,343 --> 00:07:25,452
as we approach thermal equilibrium.
We can start with any distribution we like

92
00:07:25,452 --> 00:07:29,501
over all these identical systems. We could
make them all, be in the same

93
00:07:29,501 --> 00:07:33,550
configuration. So, that's the distribution
with a property of one on one

94
00:07:33,550 --> 00:07:37,941
configuration, and zero on everything
else. Or we could start them off, with an

95
00:07:37,941 --> 00:07:41,078
equal number of systems in each possible
configuration.

96
00:07:41,078 --> 00:07:45,514
So that's a uniform distribution.
And then, we're going to keep applying our

97
00:07:45,514 --> 00:07:49,247
stochastic update rule.
Which, in the case of a stochastic

98
00:07:49,247 --> 00:07:53,373
Hopfield net would mean,
You pick a unit, and you look at its

99
00:07:53,373 --> 00:07:56,713
energy gap.
And you make a random decision based on

100
00:07:56,713 --> 00:08:00,578
that energy gap about whether to turn it
on or turn it off.

101
00:08:00,578 --> 00:08:03,460
Then, you go and pick another unit, and so
on.

102
00:08:03,880 --> 00:08:10,001
We keep applying that stochastic rule.
And after we've run systems stochastically

103
00:08:10,001 --> 00:08:13,499
in this way,
We may eventually reach a situation where

104
00:08:13,499 --> 00:08:17,840
the fraction of the systems in each
configuration remains constant.

105
00:08:17,840 --> 00:08:22,051
In fact, that's what will happen if we
have symmetric connections.

106
00:08:22,051 --> 00:08:26,975
That's the stationary distribution that
physicists call thermal equilibrium.

107
00:08:26,975 --> 00:08:30,214
Any given system keeps changing its
configuration.

108
00:08:30,214 --> 00:08:34,296
We apply the update rule,
And the states of its units will keep

109
00:08:34,296 --> 00:08:39,226
flipping between zero and one.
But, the fraction of systems in any

110
00:08:39,226 --> 00:08:45,098
particular configuration doesn't change.
And that's because we have many, many more

111
00:08:45,098 --> 00:08:51,000
systems than we have configurations.
So, here's an analogy kust to help with

112
00:08:51,000 --> 00:08:55,443
the concept.
Imagine a very large casino in Las Vegas

113
00:08:55,443 --> 00:09:00,416
with lots of card dealers. And, in fact,
we have many more than 52 factorial card

114
00:09:00,416 --> 00:09:05,848
dealers. We start with all the card packs
in the standard order that they come from

115
00:09:05,848 --> 00:09:11,149
the manufacturer. Let's suppose that has
the ace of spades, and the king of spades,

116
00:09:11,149 --> 00:09:15,688
and the queen of spades.
And then, the dealers all start shuffling.

117
00:09:15,688 --> 00:09:20,930
And they do random shuffles, they don't do
fancy shuffles that bring them back to the

118
00:09:20,930 --> 00:09:24,569
same order again.
After a few shuffles, there's still a good

119
00:09:24,569 --> 00:09:29,502
chance that the king of spades will be
next to the queen of spades in any given

120
00:09:29,502 --> 00:09:32,401
pack.
So, the packs have not yet forgotten where

121
00:09:32,401 --> 00:09:35,731
they started.
Their initial order is still influencing

122
00:09:35,731 --> 00:09:39,184
their current order.
If we keep shuffling, eventually the

123
00:09:39,184 --> 00:09:43,622
initial order will be irrelevant.
The packs will have forgotten where they

124
00:09:43,622 --> 00:09:46,377
started.
And, in fact, in this example, there will

125
00:09:46,377 --> 00:09:50,596
be an equal number of packs in each of the
52 factorial possible orders.

126
00:09:50,596 --> 00:09:53,410
Once this has happened, if we carry on
shuffling,

127
00:09:53,410 --> 00:09:58,126
There'll still be an equal number of packs
in each of the 52 factorial orders.

128
00:09:58,126 --> 00:10:02,419
That's why it's called equilibrium.
It's because the fraction in any one

129
00:10:02,419 --> 00:10:06,531
configuration doesn't change,
Even though the individual systems are

130
00:10:06,531 --> 00:10:09,917
still changing.
The thing that's wrong with this analogy

131
00:10:09,917 --> 00:10:14,634
is that once we've each equilibrium here,
all configurations have equal energy.

132
00:10:14,634 --> 00:10:17,173
And so, they all have the same
probability.

133
00:10:17,173 --> 00:10:21,708
In general, we're interested in reaching
equilibrium for systems where some

134
00:10:21,708 --> 00:10:24,430
configurations have lower energy than
others.