1
00:00:03,870 --> 00:00:08,296
All right. Now that we know what mean field is and we've derived the formulas,

2
00:00:08,296 --> 00:00:09,730
let's see an example.

3
00:00:09,730 --> 00:00:11,780
It is called an Ising model.

4
00:00:11,780 --> 00:00:15,315
This model is widely used in physics.

5
00:00:15,315 --> 00:00:17,095
So we have a model,

6
00:00:17,095 --> 00:00:20,165
that is a two dimensional lattice.

7
00:00:20,165 --> 00:00:27,285
Its elements are running variables that can take value of -1 or +1.

8
00:00:27,285 --> 00:00:32,670
We also need a functional neighbor that returns a set of neighboring elements.

9
00:00:32,670 --> 00:00:36,695
For example, here, that will have three neighbors.

10
00:00:36,695 --> 00:00:41,595
We define the joint probability over all these variables in the following way.

11
00:00:41,595 --> 00:00:48,245
It would be proportional to an exponent of 1/2 times J,

12
00:00:48,245 --> 00:00:50,431
that is a parameter form model,

13
00:00:50,431 --> 00:00:53,390
times the sum over all edges,

14
00:00:53,390 --> 00:00:56,390
and the product of the two random variables.

15
00:00:56,390 --> 00:01:00,475
If the neighboring values have the same sign,

16
00:01:00,475 --> 00:01:03,735
it will contribute one to the total sum.

17
00:01:03,735 --> 00:01:05,505
And if the product is -1,

18
00:01:05,505 --> 00:01:08,515
it will contribute -1 to the total sum.

19
00:01:08,515 --> 00:01:17,565
And also we have another term that is sum over all nodes are the letters, bi times yi.

20
00:01:17,565 --> 00:01:19,875
This is called an external field.

21
00:01:19,875 --> 00:01:25,975
So we'd know this function exponent of some terms as phi y,

22
00:01:25,975 --> 00:01:28,900
and we'll see what we can do with this model.

23
00:01:28,900 --> 00:01:30,257
But first of all,

24
00:01:30,257 --> 00:01:32,285
let's interpret it somehow.

25
00:01:32,285 --> 00:01:34,678
If J is greater than one,

26
00:01:34,678 --> 00:01:40,220
then the values yi will tend to have the same sign.

27
00:01:40,220 --> 00:01:42,214
This is the case for ferromagnetics.

28
00:01:42,214 --> 00:01:47,275
And yi's can be interpreted as spins of atoms.

29
00:01:47,275 --> 00:01:49,690
If J is less than one,

30
00:01:49,690 --> 00:01:53,575
the neighboring spins will tend to anti-align,

31
00:01:53,575 --> 00:01:55,660
and this is shown on the right.

32
00:01:55,660 --> 00:02:00,825
So, we have defined the distribution up to a normalization constant.

33
00:02:00,825 --> 00:02:06,325
But to compute it, we'll have to sum up over all possible states.

34
00:02:06,325 --> 00:02:13,970
And there are two power and square terms and it seems impossible for large lattices.

35
00:02:13,970 --> 00:02:22,405
So let's try to apply a meaningful approximation to compute the p or y approximately.

36
00:02:22,405 --> 00:02:25,920
So we'll try to approximate this by product of

37
00:02:25,920 --> 00:02:31,310
some terms and each term would be a distribution over each variable.

38
00:02:31,310 --> 00:02:34,642
So here's an example.

39
00:02:34,642 --> 00:02:37,449
We have four nodes here,

40
00:02:37,449 --> 00:02:39,770
and the central node i.

41
00:02:39,770 --> 00:02:43,890
The external field parameterized by branches

42
00:02:43,890 --> 00:02:48,390
B is shown here as the yellow and the green sides.

43
00:02:48,390 --> 00:02:49,770
On the yellow side,

44
00:02:49,770 --> 00:02:52,343
there is a strong negative field,

45
00:02:52,343 --> 00:02:55,580
and on the green side there is strong positive field.

46
00:02:55,580 --> 00:02:57,028
So on the positive field,

47
00:02:57,028 --> 00:03:04,080
the values of the corresponding nodes would try to have both positive sign.

48
00:03:04,080 --> 00:03:08,985
In negative field, the nodes would try to have a negative sign.

49
00:03:08,985 --> 00:03:11,565
So actually in this case,

50
00:03:11,565 --> 00:03:13,485
the left node would say something like,

51
00:03:13,485 --> 00:03:16,530
I feel the negative field on me.

52
00:03:16,530 --> 00:03:21,645
And the other three nodes will say something like,

53
00:03:21,645 --> 00:03:23,850
they feel the positive field.

54
00:03:23,850 --> 00:03:26,040
And so they have some information and

55
00:03:26,040 --> 00:03:29,665
they will try to contribute it to the current node i.

56
00:03:29,665 --> 00:03:34,683
And this would be done using mean field formula.

57
00:03:34,683 --> 00:03:37,995
So here is our joint distribution.

58
00:03:37,995 --> 00:03:40,705
And here's a small picture of our model.

59
00:03:40,705 --> 00:03:46,740
So we're interested now in deriving the update formula for the mean field approximation.

60
00:03:46,740 --> 00:03:47,895
So what we'd like to do,

61
00:03:47,895 --> 00:03:55,778
is to find the q of Yk.

62
00:03:55,778 --> 00:04:00,440
We'll do this using mean field approximation.

63
00:04:00,440 --> 00:04:06,965
So, the idea is that the neighboring points already know some information,

64
00:04:06,965 --> 00:04:09,105
about the external fields B.

65
00:04:09,105 --> 00:04:13,771
For example, this says that there is external field of the sign plus,

66
00:04:13,771 --> 00:04:17,105
this also plus, here minus and minus.

67
00:04:17,105 --> 00:04:21,918
So it has some information and they want to propagate it to our node.

68
00:04:21,918 --> 00:04:24,885
And we'll see how it is done.

69
00:04:24,885 --> 00:04:29,440
So the formula that we derived in the previous video looks as follows.

70
00:04:29,440 --> 00:04:35,805
We know that the logarithm of qk,

71
00:04:35,805 --> 00:04:38,505
let me write it down an index K here,

72
00:04:38,505 --> 00:04:44,065
equals to the expectation over all variables except for the K,

73
00:04:44,065 --> 00:04:48,883
and we write it down as q minus K,

74
00:04:48,883 --> 00:04:53,715
the logarithm of the actual preview that we are trying to approximate.

75
00:04:53,715 --> 00:04:59,555
So it will be p of y, plus some constant.

76
00:04:59,555 --> 00:05:04,725
Notice here that we didn't write down the full distribution,

77
00:05:04,725 --> 00:05:08,098
since we do not know the minimization constant.

78
00:05:08,098 --> 00:05:11,610
However, here we can take it out into the constant.

79
00:05:11,610 --> 00:05:17,505
So now we can omit the terms that do not depend on Yk.

80
00:05:17,505 --> 00:05:19,070
And if we write it down carefully,

81
00:05:19,070 --> 00:05:21,268
we'll get the following formula.

82
00:05:21,268 --> 00:05:23,420
So have that expectation,

83
00:05:23,420 --> 00:05:27,425
over all terms except for the K's.

84
00:05:27,425 --> 00:05:29,150
So the overall of these terms.

85
00:05:29,150 --> 00:05:33,520
So we can omit the exponent and have it on this thing.

86
00:05:33,520 --> 00:05:35,431
So let me write it down.

87
00:05:35,431 --> 00:05:42,380
It's like this, J sum over J that are

88
00:05:42,380 --> 00:05:50,558
neighbors of K, Yk,Yj.

89
00:05:50,558 --> 00:05:54,605
So I omitted one half here since in this formula,

90
00:05:54,605 --> 00:05:58,370
we use each edge twice.

91
00:05:58,370 --> 00:06:01,660
And so here, we only want to write down once.

92
00:06:01,660 --> 00:06:08,845
And plus Bk, Yk.

93
00:06:08,845 --> 00:06:15,850
So this goes under the exponent and goes on constant.

94
00:06:15,850 --> 00:06:17,980
All right.

95
00:06:17,980 --> 00:06:24,240
We can tell you the expectation and put it under the summation.

96
00:06:24,240 --> 00:06:29,970
We'll get J times the sum over J that are

97
00:06:29,970 --> 00:06:37,171
neighboring points for the current note.

98
00:06:37,171 --> 00:06:41,120
We take expectation over all variables except for the

99
00:06:41,120 --> 00:06:45,710
K. So Yk is actually constant with respect to the integration,

100
00:06:45,710 --> 00:06:50,165
so we can write it down here and take the expectation Yj.

101
00:06:50,165 --> 00:06:56,985
And this term is simply a constant with respect to integration here.

102
00:06:56,985 --> 00:07:06,420
So we'll have just Bk, Yk plus on constant.

103
00:07:06,420 --> 00:07:12,750
So let's note the expected value of Yj as mu J.

104
00:07:12,750 --> 00:07:20,820
It's just that mean value of the J's node.

105
00:07:20,820 --> 00:07:26,350
And actually, the information that the node obtained from the other nodes or

106
00:07:26,350 --> 00:07:32,765
from the external field in this point is contained in the value of mu J.

107
00:07:32,765 --> 00:07:37,280
So, this equals to,

108
00:07:37,280 --> 00:07:45,105
we can actually group the terms corresponding to Yk and get the following function.

109
00:07:45,105 --> 00:07:55,980
This would be Yk times the J sum over mu J under the neighbors.

110
00:07:56,540 --> 00:08:05,400
Plus BK. Since we don't want to write this down multiple times,

111
00:08:05,400 --> 00:08:09,415
let's say that this thing equals to some constant M,

112
00:08:09,415 --> 00:08:16,190
and find the close constant.

113
00:08:16,190 --> 00:08:22,971
So now, I want to estimate the distribution QK but for now it's only up to a constant.

114
00:08:22,971 --> 00:08:26,635
Let's take the exponent of both parts,

115
00:08:26,635 --> 00:08:31,840
and also remember that the interval of QK should be equal to one.

116
00:08:31,840 --> 00:08:36,970
In this case, it means that Q of plus one,

117
00:08:36,970 --> 00:08:43,300
plus Q of minus one should be equal to one.

118
00:08:43,300 --> 00:08:48,245
We can plug in this formula here.

119
00:08:48,245 --> 00:08:52,320
I will have exponent of,

120
00:08:52,320 --> 00:08:59,070
so here Yk equals to plus one exponent of M times the exponent of the constant,

121
00:08:59,070 --> 00:09:00,680
let's right it down C,

122
00:09:00,680 --> 00:09:04,810
plus again the same constant C,

123
00:09:04,810 --> 00:09:10,520
and the E to power of minus M, and it should be equal to one.

124
00:09:10,520 --> 00:09:19,330
C here should be equal to one over E to the power of M,

125
00:09:19,330 --> 00:09:27,595
plus E to the power of minus M. This is the value for the constant.

126
00:09:27,595 --> 00:09:31,023
And finally, we can compute the probabilities.

127
00:09:31,023 --> 00:09:32,960
So the probability that Q equals one,

128
00:09:32,960 --> 00:09:38,450
equals E to the power of M over this constant C,

129
00:09:38,450 --> 00:09:45,525
each with the power of M, plus E to the power of minus M. What is this function?

130
00:09:45,525 --> 00:09:51,605
What do you think? We can multiply it by each with the power of minus M,

131
00:09:51,605 --> 00:09:57,235
we'll have one over one plus E with the power of the minus 2M,

132
00:09:57,235 --> 00:10:03,785
and actually equals the sigmoid function of 2M.

133
00:10:03,785 --> 00:10:08,780
All right, so now we can update the distributions,

134
00:10:08,780 --> 00:10:12,305
however, we need to do one more thing.

135
00:10:12,305 --> 00:10:17,270
Notice here that we used the value of μj.

136
00:10:17,270 --> 00:10:22,330
For the other nodes to be able to use these constants μj,

137
00:10:22,330 --> 00:10:26,260
we need to update the μj for our node.

138
00:10:26,260 --> 00:10:33,304
We need to compute the μK. This is an expectation of Yk.

139
00:10:33,304 --> 00:10:37,880
It is simply Q at the position plus one,

140
00:10:37,880 --> 00:10:40,980
minus Q at the position of minus one.

141
00:10:40,980 --> 00:10:45,655
We can again plug in the similar formulas.

142
00:10:45,655 --> 00:10:50,825
This would be each with the power of M minus E to

143
00:10:50,825 --> 00:10:57,110
the power minus M over the normalization constant.

144
00:10:57,110 --> 00:11:00,414
As you may notice,

145
00:11:00,414 --> 00:11:09,385
this actually equals to the hyperbolic tangent.

146
00:11:09,385 --> 00:11:12,915
Here's our final formula.

147
00:11:12,915 --> 00:11:16,520
Let's again and see how it works.

148
00:11:16,520 --> 00:11:17,694
We iterate our nodes,

149
00:11:17,694 --> 00:11:19,950
we select some node.

150
00:11:19,950 --> 00:11:24,255
We compute the probabilities Q,

151
00:11:24,255 --> 00:11:28,290
and then update the value of μk.

152
00:11:28,290 --> 00:11:34,809
And while we update the probabilities Q, we also use the values μj.

153
00:11:34,809 --> 00:11:38,170
Also actually, here it is QK,

154
00:11:38,170 --> 00:11:45,870
which is actually true since we're estimating the values for the μK.

155
00:11:45,870 --> 00:11:49,860
Now that we've derived an update formula,

156
00:11:49,860 --> 00:11:51,954
let's see how this one will work for different values of J.

157
00:11:51,954 --> 00:11:54,380
Here's our setup.

158
00:11:54,380 --> 00:11:56,595
We have two areas,

159
00:11:56,595 --> 00:12:00,025
the white one corresponds to the positive external field,

160
00:12:00,025 --> 00:12:03,360
and the black one corresponds to the negative external field.

161
00:12:03,360 --> 00:12:05,025
If J is 0,

162
00:12:05,025 --> 00:12:09,125
then with probability one on the white area,

163
00:12:09,125 --> 00:12:12,330
the spins would tend to be plus one.

164
00:12:12,330 --> 00:12:15,615
On the black area,

165
00:12:15,615 --> 00:12:19,715
the probability would be one for having minus one.

166
00:12:19,715 --> 00:12:24,605
And everywhere else we'll have the probability 0.5 for each possible state.

167
00:12:24,605 --> 00:12:29,600
It happens because there is no interaction between neighboring points since J is 0.

168
00:12:29,600 --> 00:12:32,085
You will have the negative J.

169
00:12:32,085 --> 00:12:35,200
We'll have a chess-like field.

170
00:12:35,200 --> 00:12:41,960
The neighboring points would try to have opposite signs,

171
00:12:41,960 --> 00:12:45,775
there will be blacks and whites nearby.

172
00:12:45,775 --> 00:12:49,320
As we go further from external field,

173
00:12:49,320 --> 00:12:51,870
the interaction is slower,

174
00:12:51,870 --> 00:12:54,640
and so when we're really far away from the field,

175
00:12:54,640 --> 00:12:57,945
the probability is actually nearly 0.5,

176
00:12:57,945 --> 00:13:03,630
which will indicate that there can be either plus one or minus one.

177
00:13:03,630 --> 00:13:08,735
All right. The final example is a strong positive J.

178
00:13:08,735 --> 00:13:12,315
In this case, we'll get a picture like this,

179
00:13:12,315 --> 00:13:15,605
one part would be white that means that we'll

180
00:13:15,605 --> 00:13:20,770
have plus one with probability one on the left upper corner,

181
00:13:20,770 --> 00:13:25,100
and everywhere else we'll have minus one with probability one.

182
00:13:25,100 --> 00:13:28,733
So actually, this situation should have be symmetric.

183
00:13:28,733 --> 00:13:35,205
Why didn't we get the opposite picture when there would be a right lower corner,

184
00:13:35,205 --> 00:13:39,270
black, and other things would be white?

185
00:13:39,270 --> 00:13:42,615
This is actually a property of the KL diversions.

186
00:13:42,615 --> 00:13:48,850
Here I have a [inaudible] the bimodal distribution,

187
00:13:48,850 --> 00:13:53,334
and I try to, approximate it by fitting the KL diversions.

188
00:13:53,334 --> 00:13:56,855
There could be two possible cases.

189
00:13:56,855 --> 00:14:01,155
One is left is that the KL diversions would fit one node,

190
00:14:01,155 --> 00:14:08,090
and on the second one is that we fit something in the middle.

191
00:14:08,090 --> 00:14:11,280
What do you think would happen when minimize the KL diversions

192
00:14:11,280 --> 00:14:15,260
between the bayesian distribution and [inaudible].

193
00:14:15,260 --> 00:14:20,445
Let's first of all see what are the properties of those two things.

194
00:14:20,445 --> 00:14:22,290
The second one captures the statistics,

195
00:14:22,290 --> 00:14:25,770
so it would have for example the correct mean.

196
00:14:25,770 --> 00:14:27,450
However, the first one has

197
00:14:27,450 --> 00:14:31,955
the very important property that the mode has high probability.

198
00:14:31,955 --> 00:14:34,140
In the second example,

199
00:14:34,140 --> 00:14:36,030
the probability of mode is really low.

200
00:14:36,030 --> 00:14:39,795
It seems that the mode is actually impossible, and so,

201
00:14:39,795 --> 00:14:42,655
for many practical cases,

202
00:14:42,655 --> 00:14:45,883
the first fit would be nicer and actually this is the case.

203
00:14:45,883 --> 00:14:48,985
Let's see why it happens. All right.

204
00:14:48,985 --> 00:14:51,270
So here's our KL diversions.

205
00:14:51,270 --> 00:14:55,200
It is an integral of Q of Z times log of the ratio.

206
00:14:55,200 --> 00:14:58,260
Let's see what happens if we assign

207
00:14:58,260 --> 00:15:03,160
non-zero probability to the Q and zero probability to the P-star.

208
00:15:03,160 --> 00:15:08,230
In this case, the KL diversions would have a value of plus infinity.

209
00:15:08,230 --> 00:15:12,630
And so, the KL divergence would try to avoid

210
00:15:12,630 --> 00:15:17,480
giving non-zero probability to the regions that are impossible from the first tier.

211
00:15:17,480 --> 00:15:23,575
It is called a zero avoiding property of KL divergence,

212
00:15:23,575 --> 00:15:27,250
and it turns out to be useful in many cases.