1
00:00:00,620 --> 00:00:03,610
As part of this course by deeplearning.ai,

2
00:00:03,610 --> 00:00:07,590
hope to not just teach you the technical
ideas in deep learning, but

3
00:00:07,590 --> 00:00:11,658
also introduce you to some of the people,
some of the heroes in deep learning.

4
00:00:11,658 --> 00:00:13,160
The people that invented so

5
00:00:13,160 --> 00:00:17,700
many of these ideas that you learn about
in this course or in this specialization.

6
00:00:17,700 --> 00:00:21,420
In these videos, I hope to also
ask these leaders of deep learning

7
00:00:21,420 --> 00:00:24,990
to give you career advice for
how you can break into deep learning, for

8
00:00:24,990 --> 00:00:27,805
how you can do research or
find a job in deep learning.

9
00:00:27,805 --> 00:00:30,156
As the first of this interview series,

10
00:00:30,156 --> 00:00:34,228
I am delighted to present to you
an interview with Geoffrey Hinton.

11
00:00:38,427 --> 00:00:44,150
Welcome Geoff, and thank you for
doing this interview with deeplearning.ai.

12
00:00:44,150 --> 00:00:46,550
>> Thank you for inviting me.

13
00:00:46,550 --> 00:00:50,088
>> I think that at this point you more
than anyone else on this planet has

14
00:00:50,088 --> 00:00:52,835
invented so
many of the ideas behind deep learning.

15
00:00:52,835 --> 00:00:57,650
And a lot of people have been calling
you the godfather of deep learning.

16
00:00:57,650 --> 00:01:01,529
Although it wasn't until we were chatting
a few minutes ago, until I realized

17
00:01:01,529 --> 00:01:05,600
you think I'm the first one to call you
that, which I'm quite happy to have done.

18
00:01:06,780 --> 00:01:11,320
But what I want to ask is,
many people know you as a legend,

19
00:01:11,320 --> 00:01:15,030
I want to ask about your
personal story behind the legend.

20
00:01:15,030 --> 00:01:19,980
So how did you get involved in, going way
back, how did you get involved in AI and

21
00:01:19,980 --> 00:01:21,520
machine learning and neural networks?

22
00:01:22,730 --> 00:01:26,960
>> So when I was at high school,
I had a classmate who was always

23
00:01:26,960 --> 00:01:31,220
better than me at everything,
he was a brilliant mathematician.

24
00:01:31,220 --> 00:01:37,010
And he came into school one day and said,
did you know the brain uses holograms?

25
00:01:38,190 --> 00:01:44,161
And I guess that was about 1966, and
I said, sort of what's a hologram?

26
00:01:44,161 --> 00:01:47,390
And he explained that in a hologram
you can chop off half of it, and

27
00:01:47,390 --> 00:01:49,730
you still get the whole picture.

28
00:01:49,730 --> 00:01:53,466
And that memories in the brain might
be distributed over the whole brain.

29
00:01:53,466 --> 00:01:56,022
And so I guess he'd read
about Lashley's experiments,

30
00:01:56,022 --> 00:01:57,939
where you chop off bits
of a rat's brain and

31
00:01:57,939 --> 00:02:01,740
discover that it's very hard to find one
bit where it stores one particular memory.

32
00:02:04,411 --> 00:02:08,920
So that's what first got me interested
in how does the brain store memories.

33
00:02:10,180 --> 00:02:12,220
And then when I went to university,

34
00:02:12,220 --> 00:02:15,130
I started off studying physiology and
physics.

35
00:02:16,400 --> 00:02:17,731
I think when I was at Cambridge,

36
00:02:17,731 --> 00:02:20,260
I was the only undergraduate
doing physiology and physics.

37
00:02:21,888 --> 00:02:25,270
And then I gave up on that and

38
00:02:25,270 --> 00:02:29,170
tried to do philosophy, because I
thought that might give me more insight.

39
00:02:29,170 --> 00:02:32,780
But that seemed to me actually

40
00:02:32,780 --> 00:02:37,130
lacking in ways of distinguishing
when they said something false.

41
00:02:37,130 --> 00:02:39,420
And so then I switched to psychology.

42
00:02:41,988 --> 00:02:45,920
And in psychology they had very,
very simple theories, and it seemed to me

43
00:02:45,920 --> 00:02:49,620
it was sort of hopelessly inadequate
to explaining what the brain was doing.

44
00:02:49,620 --> 00:02:52,737
So then I took some time off and
became a carpenter.

45
00:02:52,737 --> 00:02:57,169
And then I decided that I'd try AI,
and went of to Edinburgh,

46
00:02:57,169 --> 00:02:59,580
to study AI with Langer Higgins.

47
00:02:59,580 --> 00:03:02,662
And he had done very nice
work on neural networks, and

48
00:03:02,662 --> 00:03:07,830
he'd just given up on neural networks, and
been very impressed by Winograd's thesis.

49
00:03:07,830 --> 00:03:11,460
So when I arrived he thought I was kind
of doing this old fashioned stuff, and

50
00:03:11,460 --> 00:03:14,210
I ought to start on symbolic AI.

51
00:03:14,210 --> 00:03:18,210
And we had a lot of fights about that, but
I just kept on doing what I believed in.

52
00:03:18,210 --> 00:03:21,138
>> And then what?

53
00:03:21,138 --> 00:03:28,033
>> I eventually got a PhD in AI, and
then I couldn't get a job in Britain.

54
00:03:28,033 --> 00:03:30,979
But I saw this very nice advertisement for

55
00:03:30,979 --> 00:03:36,070
Sloan Fellowships in California,
and I managed to get one of those.

56
00:03:36,070 --> 00:03:40,625
And I went to California, and
everything was different there.

57
00:03:40,625 --> 00:03:46,685
So in Britain,
neural nets was regarded as kind of silly,

58
00:03:46,685 --> 00:03:50,272
and in California, Don Norman and

59
00:03:50,272 --> 00:03:56,640
David Rumelhart were very open
to ideas about neural nets.

60
00:03:56,640 --> 00:04:00,720
It was the first time I'd been somewhere
where thinking about how the brain works,

61
00:04:00,720 --> 00:04:03,290
and thinking about how that
might relate to psychology,

62
00:04:03,290 --> 00:04:05,650
was seen as a very positive thing.

63
00:04:05,650 --> 00:04:06,936
And it was a lot of fun there,

64
00:04:06,936 --> 00:04:09,792
in particular collaborating
with David Rumelhart was great.

65
00:04:09,792 --> 00:04:12,968
>> I see, great.
So this was when you were at UCSD, and

66
00:04:12,968 --> 00:04:16,177
you and Rumelhart around what, 1982,

67
00:04:16,177 --> 00:04:20,182
wound up writing the seminal
backprop paper, right?

68
00:04:20,182 --> 00:04:23,292
>> Actually,
it was more complicated than that.

69
00:04:23,292 --> 00:04:24,796
>> What happened?

70
00:04:24,796 --> 00:04:28,214
>> In, I think, early 1982,

71
00:04:28,214 --> 00:04:32,900
David Rumelhart and me, and Ron Williams,

72
00:04:32,900 --> 00:04:37,967
between us developed
the backprop algorithm,

73
00:04:37,967 --> 00:04:42,291
it was mainly David Rumelhart's idea.

74
00:04:42,291 --> 00:04:46,390
We discovered later that many
other people had invented it.

75
00:04:46,390 --> 00:04:52,798
David Parker had invented, it probably
after us, but before we'd published.

76
00:04:52,798 --> 00:04:56,425
Paul Werbos had published it already
quite a few years earlier, but

77
00:04:56,425 --> 00:04:58,860
nobody paid it much attention.

78
00:04:58,860 --> 00:05:01,923
And there were other people who'd
developed very similar algorithms,

79
00:05:01,923 --> 00:05:04,340
it's not clear what's meant by backprop.

80
00:05:04,340 --> 00:05:08,055
But using the chain rule to get
derivatives was not a novel idea.

81
00:05:08,055 --> 00:05:12,484
>> I see, why do you think it
was your paper that helped so

82
00:05:12,484 --> 00:05:15,940
much the community latch on to backprop?

83
00:05:15,940 --> 00:05:20,540
It feels like your paper marked
an infection in the acceptance of this

84
00:05:20,540 --> 00:05:22,934
algorithm, whoever accepted it.

85
00:05:22,934 --> 00:05:26,675
>> So we managed to get
a paper into Nature in 1986.

86
00:05:26,675 --> 00:05:30,580
And I did quite a lot of political
work to get the paper accepted.

87
00:05:30,580 --> 00:05:34,622
I figured out that one of the referees was
probably going to be Stuart Sutherland,

88
00:05:34,622 --> 00:05:36,992
who was a well known
psychologist in Britain.

89
00:05:36,992 --> 00:05:38,815
And I went to talk to him for
a long time, and

90
00:05:38,815 --> 00:05:41,480
explained to him exactly
what was going on.

91
00:05:41,480 --> 00:05:44,140
And he was very impressed by the fact

92
00:05:44,140 --> 00:05:48,970
that we showed that backprop could
learn representations for words.

93
00:05:48,970 --> 00:05:52,490
And you could look at those
representations, which are little vectors,

94
00:05:52,490 --> 00:05:55,950
and you could understand the meaning
of the individual features.

95
00:05:55,950 --> 00:06:01,600
So we actually trained it on little
triples of words about family trees,

96
00:06:01,600 --> 00:06:06,420
like Mary has mother Victoria.

97
00:06:06,420 --> 00:06:11,550
And you'd give it the first two words, and
it would have to predict the last word.

98
00:06:11,550 --> 00:06:12,970
And after you trained it,

99
00:06:12,970 --> 00:06:17,780
you could see all sorts of features in the
representations of the individual words.

100
00:06:17,780 --> 00:06:19,950
Like the nationality of the person there,

101
00:06:19,950 --> 00:06:25,180
what generation they were, which branch of
the family tree they were in, and so on.

102
00:06:25,180 --> 00:06:27,680
That was what made Stuart Sutherland
really impressed with it, and

103
00:06:27,680 --> 00:06:29,666
I think that's why the paper got accepted.

104
00:06:29,666 --> 00:06:33,905
>> Very early word embeddings,
and you're already seeing learned

105
00:06:33,905 --> 00:06:38,390
features of semantic meanings
emerge from the training algorithm.

106
00:06:38,390 --> 00:06:44,090
>> Yes, so from a psychologist's point of
view, what was interesting was it unified

107
00:06:44,090 --> 00:06:49,740
two completely different strands of
ideas about what knowledge was like.

108
00:06:49,740 --> 00:06:53,460
So there was the old psychologist's
view that a concept is just a big

109
00:06:53,460 --> 00:06:56,810
bundle of features, and
there's lots of evidence for that.

110
00:06:56,810 --> 00:07:02,180
And then there was the AI view of the
time, which is a formal structurist view.

111
00:07:02,180 --> 00:07:06,190
Which was that a concept is how
it relates to other concepts.

112
00:07:06,190 --> 00:07:09,820
And to capture a concept, you'd have to
do something like a graph structure or

113
00:07:09,820 --> 00:07:11,640
maybe a semantic net.

114
00:07:11,640 --> 00:07:15,875
And what this back propagation
example showed was, you could give it

115
00:07:15,875 --> 00:07:21,070
the information that would go into a graph
structure, or in this case a family tree.

116
00:07:22,080 --> 00:07:26,920
And it could convert that information into
features in such a way that it could then

117
00:07:26,920 --> 00:07:33,470
use the features to derive new
consistent information, ie generalize.

118
00:07:33,470 --> 00:07:38,438
But the crucial thing was this to and fro
between the graphical representation or

119
00:07:38,438 --> 00:07:43,000
the tree structured representation
of the family tree, and

120
00:07:43,000 --> 00:07:46,715
a representation of the people
as big feature vectors.

121
00:07:46,715 --> 00:07:50,873
And in fact that from the graph-like
representation you could get feature

122
00:07:50,873 --> 00:07:51,469
vectors.

123
00:07:51,469 --> 00:07:54,995
And from the feature vectors, you could
get more of the graph-like representation.

124
00:07:54,995 --> 00:07:57,730
>> So this is 1986?

125
00:07:57,730 --> 00:08:02,430
In the early 90s, Bengio showed that
you can actually take real data,

126
00:08:02,430 --> 00:08:07,420
you could take English text, and
apply the same techniques there, and

127
00:08:07,420 --> 00:08:13,980
get embeddings for real words from English
text, and that impressed people a lot.

128
00:08:13,980 --> 00:08:18,682
>> I guess recently we've been talking a
lot about how fast computers like GPUs and

129
00:08:18,682 --> 00:08:21,750
supercomputers that's
driving deep learning.

130
00:08:21,750 --> 00:08:26,376
I didn't realize that back between 1986
and the early 90's, it sounds like between

131
00:08:26,376 --> 00:08:29,570
you and Benjio there was already
the beginnings of this trend.

132
00:08:30,600 --> 00:08:32,630
>> Yes, it was a huge advance.

133
00:08:32,630 --> 00:08:41,440
In 1986, I was using a list machine which
was less than a tenth of a mega flop.

134
00:08:41,440 --> 00:08:47,720
And by about 1993 or thereabouts,
people were seeing ten mega flops.

135
00:08:47,720 --> 00:08:49,600
>> I see.
>> So there was a factor of 100,

136
00:08:49,600 --> 00:08:51,770
and that's the point at
which is was easy to use,

137
00:08:51,770 --> 00:08:53,580
because computers were
just getting faster.

138
00:08:53,580 --> 00:08:56,960
>> Over the past several decades,
you've invented so

139
00:08:56,960 --> 00:08:59,970
many pieces of neural networks and
deep learning.

140
00:08:59,970 --> 00:09:02,670
I'm actually curious,
of all of the things you've invented,

141
00:09:02,670 --> 00:09:05,050
which of the ones you're still
most excited about today?

142
00:09:06,940 --> 00:09:09,590
>> So I think the most beautiful
one is the work I do with

143
00:09:09,590 --> 00:09:12,620
Terry Sejnowski on Boltzmann machines.

144
00:09:12,620 --> 00:09:14,500
So we discovered there was this really,

145
00:09:14,500 --> 00:09:18,830
really simple learning algorithm
that applied to great big

146
00:09:18,830 --> 00:09:23,550
density connected nets where you
could only see a few of the nodes.

147
00:09:23,550 --> 00:09:27,730
So it would learn hidden representations
and it was a very simple algorithm.

148
00:09:27,730 --> 00:09:31,130
And it looked like the kind of thing you
should be able to get in a brain because

149
00:09:31,130 --> 00:09:34,210
each synapse only needed to know
about the behavior of the two

150
00:09:34,210 --> 00:09:35,940
neurons it was directly connected to.

151
00:09:37,010 --> 00:09:41,230
And the information that was
propagated was the same.

152
00:09:41,230 --> 00:09:45,160
There were two different phases,
which we called wake and sleep.

153
00:09:45,160 --> 00:09:46,820
But in the two different phases,

154
00:09:46,820 --> 00:09:48,760
you're propagating information
in just the same way.

155
00:09:48,760 --> 00:09:52,360
Where as in something like back
propagation, there's a forward pass and

156
00:09:52,360 --> 00:09:54,820
a backward pass, and
they work differently.

157
00:09:54,820 --> 00:09:56,379
They're sending different
kinds of signals.

158
00:09:58,100 --> 00:10:01,190
So I think that's
the most beautiful thing.

159
00:10:01,190 --> 00:10:03,730
And for many years it looked
just like a curiosity,

160
00:10:03,730 --> 00:10:05,090
because it looked like
it was much too slow.

161
00:10:06,210 --> 00:10:10,420
But then later on, I got rid of a little
bit of the beauty, and it started letting

162
00:10:10,420 --> 00:10:13,730
me settle down and just use one iteration,
in a somewhat simpler net.

163
00:10:13,730 --> 00:10:16,570
And that gave restricted
Boltzmann machines,

164
00:10:16,570 --> 00:10:19,430
which actually worked
effectively in practice.

165
00:10:19,430 --> 00:10:21,586
So in the Netflix competition,
for example,

166
00:10:21,586 --> 00:10:26,170
restricted Boltzmann machines were one
of the ingredients of the winning entry.

167
00:10:26,170 --> 00:10:30,210
>> And in fact, a lot of the recent
resurgence of neural net and

168
00:10:30,210 --> 00:10:34,790
deep learning, starting about 2007,
was the restricted Boltzmann machine,

169
00:10:34,790 --> 00:10:37,710
and derestricted Boltzmann machine
work that you and your lab did.

170
00:10:38,940 --> 00:10:42,130
>> Yes so that's another of the pieces
of work I'm very happy with,

171
00:10:42,130 --> 00:10:46,290
the idea of that you could train your
restricted Boltzmann machine, which just

172
00:10:46,290 --> 00:10:51,120
had one layer of hidden features and
you could learn one layer of feature.

173
00:10:51,120 --> 00:10:54,850
And then you could treat those
features as data and do it again, and

174
00:10:54,850 --> 00:10:57,953
then you could treat the new features
you learned as data and do it again,

175
00:10:57,953 --> 00:10:59,570
as many times as you liked.

176
00:10:59,570 --> 00:11:03,060
So that was nice, it worked in practice.

177
00:11:03,060 --> 00:11:08,709
And then UY Tay realized that the whole
thing could be treated as a single model,

178
00:11:08,709 --> 00:11:11,110
but it was a weird kind of model.

179
00:11:11,110 --> 00:11:15,946
It was a model where at the top you had
a restricted Boltzmann machine, but

180
00:11:15,946 --> 00:11:20,626
below that you had a Sigmoid belief
net which was something that

181
00:11:20,626 --> 00:11:23,060
invented many years early.

182
00:11:23,060 --> 00:11:24,620
So it was a directed model and

183
00:11:24,620 --> 00:11:28,651
what we'd managed to come up with by
training these restricted Boltzmann

184
00:11:28,651 --> 00:11:32,760
machines was an efficient way of doing
inferences in Sigmoid belief nets.

185
00:11:33,830 --> 00:11:36,870
So, around that time,

186
00:11:36,870 --> 00:11:41,270
there were people doing neural nets,
who would use densely connected nets, but

187
00:11:41,270 --> 00:11:45,500
didn't have any good ways of doing
probabilistic imprints in them.

188
00:11:45,500 --> 00:11:50,050
And you had people doing graphical models,
unlike my children,

189
00:11:50,050 --> 00:11:55,603
who could do inference properly, but
only in sparsely connected nets.

190
00:11:55,603 --> 00:12:01,140
And what we managed to show was
the way of learning these deep

191
00:12:01,140 --> 00:12:06,280
belief nets so that there's an approximate
form of inference that's very fast,

192
00:12:06,280 --> 00:12:10,578
it's just hands in a single forward pass
and that was a very beautiful result.

193
00:12:10,578 --> 00:12:14,890
And you could guarantee that each time
you learn that extra layer of features

194
00:12:16,010 --> 00:12:19,980
there was a band, each time you learned
a new layer, you got a new band, and

195
00:12:19,980 --> 00:12:22,700
the new band was always
better than the old band.

196
00:12:22,700 --> 00:12:25,810
>> The variational bands,
showing as you add layers.

197
00:12:25,810 --> 00:12:26,970
Yes, I remember that video.

198
00:12:26,970 --> 00:12:29,680
>> So that was the second thing
that I was really excited about.

199
00:12:29,680 --> 00:12:35,600
And I guess the third thing was the work
I did with on variational methods.

200
00:12:35,600 --> 00:12:40,750
It turns out people in statistics
had done similar work earlier,

201
00:12:40,750 --> 00:12:43,100
but we didn't know about that.

202
00:12:44,610 --> 00:12:47,260
So we managed to make

203
00:12:47,260 --> 00:12:50,250
EN work a whole lot better by showing
you didn't need to do a perfect E step.

204
00:12:50,250 --> 00:12:52,800
You could do an approximate E step.

205
00:12:52,800 --> 00:12:55,320
And EN was a big algorithm in statistics.

206
00:12:55,320 --> 00:12:58,380
And we'd showed a big
generalization of it.

207
00:12:58,380 --> 00:13:02,490
And in particular, in 1993,
I guess, with Van Camp.

208
00:13:02,490 --> 00:13:07,040
I did a paper, with I think,
the first variational Bayes paper,

209
00:13:07,040 --> 00:13:12,090
where we showed that you could actually
do a version of Bayesian learning

210
00:13:12,090 --> 00:13:17,950
that was far more tractable, by
approximating the true posterior with a.

211
00:13:17,950 --> 00:13:20,320
And you could do that in neural net.

212
00:13:20,320 --> 00:13:22,600
And I was very excited by that.

213
00:13:22,600 --> 00:13:23,680
>> I see.
Wow, right.

214
00:13:23,680 --> 00:13:26,670
Yep, I think I remember
all of these papers.

215
00:13:26,670 --> 00:13:32,630
You and Hinton, approximate Paper,
spent many hours reading over that.

216
00:13:32,630 --> 00:13:36,070
And I think some of
the algorithms you use today, or

217
00:13:36,070 --> 00:13:41,110
some of the algorithms that lots of
people use almost every day, are what,

218
00:13:41,110 --> 00:13:46,570
things like dropouts, or
I guess activations came from your group?

219
00:13:46,570 --> 00:13:47,390
>> Yes and no.

220
00:13:47,390 --> 00:13:51,470
So other people have thought
about rectified linear units.

221
00:13:51,470 --> 00:13:56,860
And we actually did some work with
restricted Boltzmann machines showing

222
00:13:56,860 --> 00:14:02,880
that a ReLU was almost exactly equivalent
to a whole stack of logistic units.

223
00:14:02,880 --> 00:14:05,190
And that's one of the things
that helped ReLUs catch on.

224
00:14:05,190 --> 00:14:07,440
>> I was really curious about that.

225
00:14:07,440 --> 00:14:12,570
The value paper had a lot of
math showing that this function

226
00:14:12,570 --> 00:14:15,530
can be approximated with this
really complicated formula.

227
00:14:15,530 --> 00:14:19,140
Did you do that math so your paper would
get accepted into an academic conference,

228
00:14:19,140 --> 00:14:24,840
or did all that math really influence
the development of max of 0 and x?

229
00:14:26,450 --> 00:14:30,440
>> That was one of the cases where
actually the math was important

230
00:14:30,440 --> 00:14:32,350
to the development of the idea.

231
00:14:32,350 --> 00:14:35,262
So I knew about rectified linear units,
obviously, and

232
00:14:35,262 --> 00:14:36,821
I knew about logistic units.

233
00:14:36,821 --> 00:14:39,250
And because of the work
on Boltzmann machines,

234
00:14:39,250 --> 00:14:42,720
all of the basic work was
done using logistic units.

235
00:14:42,720 --> 00:14:45,120
And so the question was,

236
00:14:45,120 --> 00:14:49,070
could the learning algorithm work in
something with rectified linear units?

237
00:14:49,070 --> 00:14:54,400
And by showing the rectified linear units
were almost exactly equivalent to a stack

238
00:14:54,400 --> 00:15:00,350
of logistic units, we showed that
all the math would go through.

239
00:15:00,350 --> 00:15:01,508
>> I see.

240
00:15:01,508 --> 00:15:05,890
And it provided the inspiration for
today, tons of people use ReLU and

241
00:15:05,890 --> 00:15:08,000
it just works without-
>> Yeah.

242
00:15:08,000 --> 00:15:12,130
>> Without necessarily needing to
understand the same motivation.

243
00:15:13,150 --> 00:15:16,850
>> Yeah, one thing I noticed
later when I went to Google.

244
00:15:16,850 --> 00:15:22,796
I guess in 2014, I gave a talk
at Google about using ReLUs and

245
00:15:22,796 --> 00:15:26,660
initializing with the identity matrix.

246
00:15:26,660 --> 00:15:30,300
because the nice thing about ReLUs is
that if you keep replicating the hidden

247
00:15:30,300 --> 00:15:32,667
layers and
you initialize with the identity,

248
00:15:32,667 --> 00:15:35,050
it just copies the pattern
in the layer below.

249
00:15:36,140 --> 00:15:40,120
And so I was showing that you could train
networks with 300 hidden layers and

250
00:15:40,120 --> 00:15:44,760
you could train them really efficiently
if you initialize with their identity.

251
00:15:44,760 --> 00:15:48,065
But I didn't pursue that any further and
I really regret not pursuing that.

252
00:15:48,065 --> 00:15:52,507
We published one paper with showing
you could initialize an active

253
00:15:52,507 --> 00:15:55,565
showing you could initialize
recurringness like that.

254
00:15:55,565 --> 00:16:00,370
But I should have pursued it further
because Later on these residual

255
00:16:00,370 --> 00:16:03,572
networks is really that kind of thing.

256
00:16:03,572 --> 00:16:06,660
>> Over the years I've heard
you talk a lot about the brain.

257
00:16:06,660 --> 00:16:09,447
I've heard you talk about relationship
being backprop and the brain.

258
00:16:09,447 --> 00:16:13,720
What are your current thoughts on that?

259
00:16:13,720 --> 00:16:16,910
>> I'm actually working on
a paper on that right now.

260
00:16:18,250 --> 00:16:21,160
I guess my main thought is this.

261
00:16:21,160 --> 00:16:25,570
If it turns out the back prop is a really
good algorithm for doing learning.

262
00:16:26,620 --> 00:16:31,610
Then for sure evolution could've
figured out how to prevent it.

263
00:16:32,730 --> 00:16:37,270
I mean you have cells that could
turn into either eyeballs or teeth.

264
00:16:37,270 --> 00:16:42,440
Now, if cells can do that, they can for
sure implement backpropagation and

265
00:16:42,440 --> 00:16:45,860
presumably this huge
selective pressure for it.

266
00:16:45,860 --> 00:16:50,490
So I think the neuroscientist idea that
it doesn't look plausible is just silly.

267
00:16:50,490 --> 00:16:52,890
There may be some subtle
implementation of it.

268
00:16:52,890 --> 00:16:56,000
And I think the brain probably has
something that may not be exactly be

269
00:16:56,000 --> 00:16:58,620
backpropagation, but
it's quite close to it.

270
00:16:58,620 --> 00:17:02,566
And over the years, I've come up with a
number of ideas about how this might work.

271
00:17:02,566 --> 00:17:06,994
So in 1987, working with Jay McClelland,

272
00:17:06,994 --> 00:17:11,202
I came up with
the recirculation algorithm,

273
00:17:11,202 --> 00:17:16,090
where the idea is you send
information round a loop.

274
00:17:17,470 --> 00:17:18,686
And you try to make it so

275
00:17:18,686 --> 00:17:22,206
that things don't change as
information goes around this loop.

276
00:17:22,206 --> 00:17:26,490
So the simplest version would be you
have input units and hidden units, and

277
00:17:26,490 --> 00:17:31,046
you send information from the input to
the hidden and then back to the input, and

278
00:17:31,046 --> 00:17:34,388
then back to the hidden and
then back to the input and so on.

279
00:17:34,388 --> 00:17:38,001
And what you want,
you want to train an autoencoder,

280
00:17:38,001 --> 00:17:42,300
but you want to train it without
having to do backpropagation.

281
00:17:42,300 --> 00:17:47,250
So you just train it to try and get rid
of all variation in the activities.

282
00:17:47,250 --> 00:17:51,922
So the idea is that the learning rule for

283
00:17:51,922 --> 00:17:57,930
synapse is change the weighting
proportion to the presynaptic input and

284
00:17:57,930 --> 00:18:01,780
in proportion to the rate of
change at the post synaptic input.

285
00:18:01,780 --> 00:18:04,060
But in recirculation, you're trying
to make the post synaptic input,

286
00:18:04,060 --> 00:18:08,330
you're trying to make the old one
be good and the new one be bad, so

287
00:18:08,330 --> 00:18:09,620
you're changing in that direction.

288
00:18:11,010 --> 00:18:14,472
We invented this algorithm before
neuroscientists come up with

289
00:18:14,472 --> 00:18:16,521
spike-timing-dependent plasticity.

290
00:18:16,521 --> 00:18:20,700
Spike-timing-dependent plasticity is
actually the same algorithm but the other

291
00:18:20,700 --> 00:18:26,220
way round, where the new thing is good and
the old thing is bad in the learning rule.

292
00:18:26,220 --> 00:18:30,010
So you're changing the weighting
proportions to the preset outlook activity

293
00:18:30,010 --> 00:18:35,690
times the new person outlook
activity minus the old one.

294
00:18:37,060 --> 00:18:42,020
Later on I realized in 2007,
that if you took a stack of

295
00:18:42,020 --> 00:18:47,830
Restricted Boltzmann machines and
you trained it up.

296
00:18:47,830 --> 00:18:52,620
After it was trained, you then had
exactly the right conditions for

297
00:18:52,620 --> 00:18:56,450
implementing backpropagation
by just trying to reconstruct.

298
00:18:56,450 --> 00:19:01,124
If you looked at the reconstruction era,
that reconstruction era would

299
00:19:01,124 --> 00:19:05,728
actually tell you the derivative
of the discriminative performance.

300
00:19:05,728 --> 00:19:12,079
And at the first deep learning workshop
at in 2007, I gave a talk about that.

301
00:19:12,079 --> 00:19:16,454
That was almost completely ignored.

302
00:19:16,454 --> 00:19:19,799
Later on, Joshua Benjo,
took up the idea and

303
00:19:19,799 --> 00:19:24,340
that's actually done quite
a lot of more work on that.

304
00:19:24,340 --> 00:19:26,490
And I've been doing
more work on it myself.

305
00:19:26,490 --> 00:19:33,280
And I think this idea that if you have
a stack of autoencoders, then you can

306
00:19:33,280 --> 00:19:38,440
get derivatives by sending activity
backwards and locate reconstructionaires,

307
00:19:38,440 --> 00:19:42,520
is a really interesting idea and
may well be how the brain does it.

308
00:19:42,520 --> 00:19:47,520
>> One other topic that I know you follow
about and that I hear you're still

309
00:19:47,520 --> 00:19:51,930
working on is how to deal with
multiple time skills in deep learning?

310
00:19:51,930 --> 00:19:54,468
So, can you share your thoughts on that?

311
00:19:54,468 --> 00:19:58,910
>> Yes, so actually, that goes back to
my first years of graduate student.

312
00:19:58,910 --> 00:20:04,040
The first talk I ever gave was about
using what I called fast weights.

313
00:20:04,040 --> 00:20:07,560
So weights that adapt rapidly,
but decay rapidly.

314
00:20:07,560 --> 00:20:08,832
And therefore can hold short term memory.

315
00:20:08,832 --> 00:20:13,496
And I showed in a very simple
system in 1973 that you could do

316
00:20:13,496 --> 00:20:16,590
true recursion with those weights.

317
00:20:16,590 --> 00:20:23,010
And what I mean by true recursion
is that the neurons that is used

318
00:20:23,010 --> 00:20:28,470
in representing things get re-used for
representing things in the recursive core.

319
00:20:30,210 --> 00:20:31,750
And the weights that is used for

320
00:20:31,750 --> 00:20:34,388
actually knowledge get re-used
in the recursive core.

321
00:20:34,388 --> 00:20:39,170
And so that leads the question of
when you pop out your recursive core,

322
00:20:39,170 --> 00:20:41,600
how do you remember what it was
you were in the middle of doing?

323
00:20:41,600 --> 00:20:42,970
Where's that memory?

324
00:20:42,970 --> 00:20:45,015
because you used the neurons for
the recursive core.

325
00:20:46,080 --> 00:20:49,240
And the answer is you can put that
memory into fast weights, and

326
00:20:49,240 --> 00:20:53,940
you can recover the activities
neurons from those fast weights.

327
00:20:53,940 --> 00:20:56,151
And more recently working with Jimmy Ba,

328
00:20:56,151 --> 00:21:00,141
we actually got a paper in it by using
fast weights for recursion like that.

329
00:21:00,141 --> 00:21:00,898
>> I see.

330
00:21:00,898 --> 00:21:04,145
>> So that was quite a big gap.

331
00:21:04,145 --> 00:21:08,746
The first model was
unpublished in 1973 and

332
00:21:08,746 --> 00:21:14,966
then Jimmy Ba's model was in 2015,
I think, or 2016.

333
00:21:14,966 --> 00:21:16,469
So it's about 40 years later.

334
00:21:16,469 --> 00:21:22,840
>> And, I guess,
one other idea of Quite a few years now,

335
00:21:22,840 --> 00:21:29,350
over five years, I think is capsules,
where are you with that?

336
00:21:29,350 --> 00:21:34,150
>> Okay, so I'm back to
the state I'm used to being in.

337
00:21:34,150 --> 00:21:39,320
Which is I have this idea I really
believe in and nobody else believes it.

338
00:21:39,320 --> 00:21:42,120
And I submit papers about it and
they would get rejected.

339
00:21:42,120 --> 00:21:45,938
But I really believe in this idea and
I'm just going to keep pushing it.

340
00:21:45,938 --> 00:21:53,880
So it hinges on,
there's a couple of key ideas.

341
00:21:53,880 --> 00:22:00,000
One is about how you represent
multi dimensional entities, and you

342
00:22:00,000 --> 00:22:05,070
can represent multi-dimensional entities
by just a little backdoor activities.

343
00:22:05,070 --> 00:22:07,630
As long as you know
there's any one of them.

344
00:22:07,630 --> 00:22:12,150
So the idea is in each region of
the image, you'll assume there's at most,

345
00:22:12,150 --> 00:22:14,000
one of the particular kind of feature.

346
00:22:15,200 --> 00:22:18,020
And then you'll use a bunch of neurons,
and

347
00:22:18,020 --> 00:22:23,190
their activities will represent
the different aspects to that feature,

348
00:22:24,230 --> 00:22:27,270
like within that region exactly
what are its x and y coordinates?

349
00:22:27,270 --> 00:22:28,780
What orientation is it at?

350
00:22:28,780 --> 00:22:29,930
How fast is it moving?

351
00:22:29,930 --> 00:22:30,630
What color is it?

352
00:22:30,630 --> 00:22:31,270
How bright is it?

353
00:22:31,270 --> 00:22:32,590
And stuff like that.

354
00:22:32,590 --> 00:22:36,350
So you can use a whole bunch of neurons
to represent different dimensions of

355
00:22:36,350 --> 00:22:37,710
the same thing.

356
00:22:37,710 --> 00:22:39,410
Provided there's only one of them.

357
00:22:40,490 --> 00:22:46,110
That's a very different way
of doing representation

358
00:22:46,110 --> 00:22:48,155
from what we're normally
used to in neuronettes.

359
00:22:48,155 --> 00:22:49,820
Normally in neuronettes,
we just have a great big layer,

360
00:22:49,820 --> 00:22:52,080
and all the units go off and
do whatever they do.

361
00:22:52,080 --> 00:22:55,770
But you don't think of bundling them
up into little groups that represent

362
00:22:55,770 --> 00:22:57,310
different coordinates of the same thing.

363
00:22:58,660 --> 00:23:02,080
So I think we should beat
this extra structure.

364
00:23:02,080 --> 00:23:05,020
And then the other idea
that goes with that.

365
00:23:05,020 --> 00:23:07,410
>> So this means in the truth
of the representation,

366
00:23:07,410 --> 00:23:09,280
you partition the representation.

367
00:23:09,280 --> 00:23:11,270
>> Yes.
>> To different subsets.

368
00:23:11,270 --> 00:23:13,900
>> Yes.
>> To represent, right, rather than-

369
00:23:13,900 --> 00:23:15,600
>> I call each of those subsets a capsule.

370
00:23:15,600 --> 00:23:16,180
>> I see.

371
00:23:16,180 --> 00:23:21,078
>> And the idea is a capsule is able to
represent an instance of a feature, but

372
00:23:21,078 --> 00:23:21,794
only one.

373
00:23:21,794 --> 00:23:27,130
And it represents all the different
properties of that feature.

374
00:23:27,130 --> 00:23:29,880
It's a feature that has a lot
of properties as opposed to

375
00:23:29,880 --> 00:23:34,530
a normal neuron and a normal neuronette,
which has just one scale of property.

376
00:23:34,530 --> 00:23:36,240
>> Yeah, I see yep.

377
00:23:36,240 --> 00:23:41,423
>> And then what you can do if you've got
that, is you can do something that normal

378
00:23:41,423 --> 00:23:48,980
neuronettes are very bad at, which is you
can do what I call routine by agreement.

379
00:23:48,980 --> 00:23:52,960
So let's suppose you want
to do segmentation and

380
00:23:52,960 --> 00:23:56,660
you have something that might be a mouth
and something else that might be a nose.

381
00:23:57,910 --> 00:24:02,179
And you want to know if you should
put them together to make one thing.

382
00:24:02,179 --> 00:24:03,879
So the idea should have a capsule for

383
00:24:03,879 --> 00:24:06,040
a mouth that has
the parameters of the mouth.

384
00:24:06,040 --> 00:24:10,582
And you have a capsule for a nose
that has the parameters of the nose.

385
00:24:10,582 --> 00:24:13,797
And then to decipher whether
to put them together or

386
00:24:13,797 --> 00:24:18,670
not, you get each of them to vote for
what the parameters should be for a face.

387
00:24:19,930 --> 00:24:23,718
Now if the mouth and the nose are in
the right spacial relationship,

388
00:24:23,718 --> 00:24:24,725
they will agree.

389
00:24:24,725 --> 00:24:28,888
So when you get two captures at one level
voting for the same set of parameters at

390
00:24:28,888 --> 00:24:32,106
the next level up,
you can assume they're probably right,

391
00:24:32,106 --> 00:24:35,350
because agreement in a high
dimensional space is very unlikely.

392
00:24:36,950 --> 00:24:42,109
And that's a very different
way of doing filtering,

393
00:24:42,109 --> 00:24:46,130
than what we normally use in neural nets.

394
00:24:46,130 --> 00:24:50,708
So I think this routing by agreement
is going to be crucial for

395
00:24:50,708 --> 00:24:56,700
getting neural nets to generalize
much better from limited data.

396
00:24:56,700 --> 00:24:59,797
I think it'd be very good at
getting the changes in viewpoint,

397
00:24:59,797 --> 00:25:01,500
very good at doing segmentation.

398
00:25:01,500 --> 00:25:04,794
And I'm hoping it will be much more
statistically efficient than what we

399
00:25:04,794 --> 00:25:06,147
currently do in neural nets.

400
00:25:06,147 --> 00:25:08,575
Which is, if you want to deal
with changes in viewpoint,

401
00:25:08,575 --> 00:25:12,000
you just give it a whole bunch of changes
in view point and training on them all.

402
00:25:12,000 --> 00:25:16,460
>> I see, right, so rather than
FIFO learning, supervised learning,

403
00:25:16,460 --> 00:25:19,120
you can learn this in some different way.

404
00:25:20,220 --> 00:25:24,120
>> Well, I still plan to do it
with supervised learning, but

405
00:25:24,120 --> 00:25:27,720
the mechanics of the forward
paths are very different.

406
00:25:27,720 --> 00:25:32,010
It's not a pure forward path in the sense
that there's little bits of iteration

407
00:25:32,010 --> 00:25:36,550
going on, where you think you found
a mouth and you think you found a nose.

408
00:25:36,550 --> 00:25:39,127
And use a little bit
of iteration to decide

409
00:25:39,127 --> 00:25:42,530
whether they should really
go together to make a face.

410
00:25:42,530 --> 00:25:46,352
And you can do back props
from that iteration.

411
00:25:46,352 --> 00:25:50,286
So you can try and
do it a little discriminatively,

412
00:25:50,286 --> 00:25:54,417
and we're working on that
now at my group in Toronto.

413
00:25:54,417 --> 00:26:00,260
So I now have a little Google team
in Toronto, part of the Brain team.

414
00:26:00,260 --> 00:26:02,127
That's what I'm excited about right now.

415
00:26:02,127 --> 00:26:02,891
>> I see, great, yeah.

416
00:26:02,891 --> 00:26:05,366
Look forward to that paper
when that comes out.

417
00:26:05,366 --> 00:26:10,750
>> Yeah, if it comes out [LAUGH].

418
00:26:10,750 --> 00:26:13,040
>> You worked in deep learning for
several decades.

419
00:26:13,040 --> 00:26:15,330
I'm actually really curious,
how has your thinking,

420
00:26:15,330 --> 00:26:18,760
your understanding of AI
changed over these years?

421
00:26:20,380 --> 00:26:27,678
>> So I guess a lot of my intellectual
history has been around back propagation,

422
00:26:27,678 --> 00:26:33,531
and how to use back propagation,
how to make use of its power.

423
00:26:33,531 --> 00:26:36,966
So to begin with, in the mid 80s,
we were using it for

424
00:26:36,966 --> 00:26:40,203
discriminative learning and
it was working well.

425
00:26:40,203 --> 00:26:42,405
I then decided, by the early 90s,

426
00:26:42,405 --> 00:26:46,749
that actually most human learning was
going to be unsupervised learning.

427
00:26:46,749 --> 00:26:50,138
And I got much more interested
in unsupervised learning, and

428
00:26:50,138 --> 00:26:54,300
that's when I worked on things
like the Wegstein algorithm.

429
00:26:54,300 --> 00:26:58,306
>> And your comments at that time
really influenced my thinking as well.

430
00:26:58,306 --> 00:27:03,010
So when I was leading Google Brain,
our first project spent a lot of

431
00:27:03,010 --> 00:27:07,900
work in unsupervised learning
because of your influence.

432
00:27:07,900 --> 00:27:09,740
>> Right, and I may have misled you.

433
00:27:09,740 --> 00:27:11,470
Because in the long run,

434
00:27:11,470 --> 00:27:13,840
I think unsupervised learning is
going to be absolutely crucial.

435
00:27:15,160 --> 00:27:19,376
But you have to sort of face reality.

436
00:27:19,376 --> 00:27:24,107
And what's worked over the last ten
years or so is supervised learning.

437
00:27:24,107 --> 00:27:27,179
Discriminative training,
where you have labels, or

438
00:27:27,179 --> 00:27:31,810
you're trying to predict the next thing
in the series, so that acts as the label.

439
00:27:31,810 --> 00:27:33,769
And that's worked incredibly well.

440
00:27:37,528 --> 00:27:42,266
I still believe that unsupervised learning
is going to be crucial, and things will

441
00:27:42,266 --> 00:27:47,145
work incredibly much better than they do
now when we get that working properly, but

442
00:27:47,145 --> 00:27:48,200
we haven't yet.

443
00:27:49,990 --> 00:27:53,225
>> Yeah, I think many of the senior
people in deep learning,

444
00:27:53,225 --> 00:27:56,074
including myself,
remain very excited about it.

445
00:27:56,074 --> 00:28:01,513
It's just none of us really have
almost any idea how to do it yet.

446
00:28:01,513 --> 00:28:04,983
Maybe you do, I don't feel like I do.

447
00:28:04,983 --> 00:28:08,160
>> Variational altering code is where
you use the reparameterization tricks.

448
00:28:08,160 --> 00:28:10,120
Seemed to me like a really nice idea.

449
00:28:10,120 --> 00:28:15,260
And generative adversarial nets also
seemed to me to be a really nice idea.

450
00:28:15,260 --> 00:28:18,645
I think generative
adversarial nets are one of

451
00:28:18,645 --> 00:28:23,430
the sort of biggest ideas in
deep learning that's really new.

452
00:28:23,430 --> 00:28:26,363
I'm hoping I can make
capsules that successful, but

453
00:28:26,363 --> 00:28:31,740
right now generative adversarial nets,
I think, have been a big breakthrough.

454
00:28:31,740 --> 00:28:34,439
>> What happened to sparsity and
slow features,

455
00:28:34,439 --> 00:28:38,806
which were two of the other principles for
building unsupervised models?

456
00:28:41,556 --> 00:28:47,788
I was never as big on
sparsity as you were, buddy.

457
00:28:47,788 --> 00:28:52,672
But slow features, I think, is a mistake.

458
00:28:52,672 --> 00:28:53,660
You shouldn't say slow.

459
00:28:53,660 --> 00:28:57,880
The basic idea is right, but you shouldn't
go for features that don't change,

460
00:28:57,880 --> 00:29:00,660
you should go for
features that change in predictable ways.

461
00:29:01,680 --> 00:29:07,060
So here's a sort of basic principle
about how you model anything.

462
00:29:08,620 --> 00:29:13,391
You take your measurements,
and you're applying nonlinear

463
00:29:13,391 --> 00:29:17,612
transformations to your
measurements until you get to

464
00:29:17,612 --> 00:29:22,672
a representation as a state vector
in which the action is linear.

465
00:29:22,672 --> 00:29:26,103
So you don't just pretend it's linear
like you do with common filters.

466
00:29:26,103 --> 00:29:29,625
But you actually find a transformation
from the observables to

467
00:29:29,625 --> 00:29:32,616
the underlying variables
where linear operations,

468
00:29:32,616 --> 00:29:37,480
like matrix multipliers on the underlying
variables, will do the work.

469
00:29:37,480 --> 00:29:39,700
So for example,
if you want to change viewpoints.

470
00:29:39,700 --> 00:29:42,890
If you want to produce the image
from another viewpoint,

471
00:29:42,890 --> 00:29:46,900
what you should do is go from
the pixels to coordinates.

472
00:29:47,950 --> 00:29:50,686
And once you got to
the coordinate representation,

473
00:29:50,686 --> 00:29:54,120
which is a kind of thing I'm
hoping captures will find.

474
00:29:54,120 --> 00:29:57,350
You can then do a matrix multiplier
to change viewpoint, and

475
00:29:57,350 --> 00:29:59,210
then you can map it back to pixels.

476
00:29:59,210 --> 00:29:59,893
>> Right, that's why you did all that.

477
00:29:59,893 --> 00:30:02,170
>> I think that's a very,
very general principle.

478
00:30:02,170 --> 00:30:04,773
>> That's why you did all that
work on face synthesis, right?

479
00:30:04,773 --> 00:30:09,355
Where you take a face and compress it
to very low dimensional vector, and so

480
00:30:09,355 --> 00:30:12,450
you can fiddle with that and
get back other faces.

481
00:30:12,450 --> 00:30:15,950
>> I had a student who worked on that,
I didn't do much work on that myself.

482
00:30:17,100 --> 00:30:19,180
>> Now I'm sure you still
get asked all the time,

483
00:30:19,180 --> 00:30:23,920
if someone wants to break into deep
learning, what should they do?

484
00:30:23,920 --> 00:30:25,040
So what advice would you have?

485
00:30:25,040 --> 00:30:28,938
I'm sure you've given a lot of advice to
people in one on one settings, but for

486
00:30:28,938 --> 00:30:31,550
the global audience of
people watching this video.

487
00:30:31,550 --> 00:30:35,999
What advice would you have for
them to get into deep learning?

488
00:30:35,999 --> 00:30:42,171
>> Okay, so my advice is sort of read the
literature, but don't read too much of it.

489
00:30:42,171 --> 00:30:48,030
So this is advice I got from my advisor,
which is very unlike what most people say.

490
00:30:48,030 --> 00:30:52,474
Most people say you should spend several
years reading the literature and

491
00:30:52,474 --> 00:30:55,421
then you should start
working on your own ideas.

492
00:30:55,421 --> 00:31:00,295
And that may be true for some researchers,
but for creative researchers I think

493
00:31:00,295 --> 00:31:03,803
what you want to do is read
a little bit of the literature.

494
00:31:03,803 --> 00:31:07,792
And notice something that you
think everybody is doing wrong,

495
00:31:07,792 --> 00:31:10,340
I'm contrary in that sense.

496
00:31:10,340 --> 00:31:13,568
You look at it and
it just doesn't feel right.

497
00:31:13,568 --> 00:31:15,660
And then figure out how to do it right.

498
00:31:16,890 --> 00:31:22,476
And then when people tell you,
that's no good, just keep at it.

499
00:31:22,476 --> 00:31:26,339
And I have a very good principle for
helping people keep at it,

500
00:31:26,339 --> 00:31:29,996
which is either your intuitions
are good or they're not.

501
00:31:29,996 --> 00:31:32,030
If your intuitions are good,
you should follow them and

502
00:31:32,030 --> 00:31:34,060
you'll eventually be successful.

503
00:31:34,060 --> 00:31:36,478
If your intuitions are not good,
it doesn't matter what you do.

504
00:31:36,478 --> 00:31:40,329
>> I see [LAUGH].

505
00:31:40,329 --> 00:31:43,420
Inspiring advice, might as well go for it.

506
00:31:43,420 --> 00:31:45,410
>> You might as well
trust your intuitions.

507
00:31:45,410 --> 00:31:47,847
There's no point not trusting them.

508
00:31:47,847 --> 00:31:49,420
>> I see, yeah.

509
00:31:49,420 --> 00:31:55,193
I usually advise people to not just read,
but replicate published papers.

510
00:31:55,193 --> 00:31:58,161
And maybe that puts a natural
limiter on how many you could do,

511
00:31:58,161 --> 00:32:00,800
because replicating results
is pretty time consuming.

512
00:32:01,910 --> 00:32:05,312
Yes, it's true that when you're
trying to replicate a published

513
00:32:05,312 --> 00:32:08,100
you discover all over little
tricks necessary to make it work.

514
00:32:08,100 --> 00:32:11,938
The other advice I have is,
never stop programming.

515
00:32:11,938 --> 00:32:15,577
Because if you give a student
something to do, if they're botching,

516
00:32:15,577 --> 00:32:18,550
they'll come back and say, it didn't work.

517
00:32:18,550 --> 00:32:22,030
And the reason it didn't work would
be some little decision they made,

518
00:32:22,030 --> 00:32:25,100
that they didn't realize is crucial.

519
00:32:25,100 --> 00:32:28,850
And if you give it to a good student,
like for example.

520
00:32:28,850 --> 00:32:31,120
You can give him anything and
he'll come back and say, it worked.

521
00:32:32,670 --> 00:32:36,420
I remember doing this once,
and I said, but wait a minute.

522
00:32:36,420 --> 00:32:37,330
Since we last talked,

523
00:32:37,330 --> 00:32:40,380
I realized it couldn't possibly work for
the following reason.

524
00:32:40,380 --> 00:32:43,586
And said, yeah, I realized that right
away, so I assumed you didn't mean that.

525
00:32:43,586 --> 00:32:47,627
>> [LAUGH] I see, yeah,
that's great, yeah.

526
00:32:47,627 --> 00:32:51,575
Let's see, any other advice for

527
00:32:51,575 --> 00:32:57,782
people that want to break into AI and
deep learning?

528
00:32:57,782 --> 00:33:02,000
>> I think that's basically, read enough
so you start developing intuitions.

529
00:33:02,000 --> 00:33:05,811
And then, trust your intuitions and
go for it,

530
00:33:05,811 --> 00:33:10,783
don't be too worried if everybody
else says it's nonsense.

531
00:33:10,783 --> 00:33:14,352
>> And I guess there's no way
to know if others are right or

532
00:33:14,352 --> 00:33:19,950
wrong when they say it's nonsense, but you
just have to go for it, and then find out.

533
00:33:19,950 --> 00:33:24,350
>> Right, but there is one thing, which
is, if you think it's a really good idea,

534
00:33:24,350 --> 00:33:27,201
and other people tell you
it's complete nonsense,

535
00:33:27,201 --> 00:33:29,761
then you know you're
really on to something.

536
00:33:29,761 --> 00:33:33,960
So one example of that is when and
I first came up with variational methods.

537
00:33:35,420 --> 00:33:40,690
I sent mail explaining it to a former
student of mine called Peter Brown,

538
00:33:40,690 --> 00:33:42,560
who knew a lot about.

539
00:33:43,570 --> 00:33:46,967
And he showed it to people
who worked with him,

540
00:33:46,967 --> 00:33:51,253
called the brothers,
they were twins, I think.

541
00:33:51,253 --> 00:33:55,914
And he then told me later what they said,
and they said,

542
00:33:55,914 --> 00:34:00,277
either this guy's drunk,
or he's just stupid, so

543
00:34:00,277 --> 00:34:04,260
they really,
really thought it was nonsense.

544
00:34:04,260 --> 00:34:06,460
Now, it could have been partly
the way I explained it,

545
00:34:06,460 --> 00:34:08,043
because I explained it in intuitive terms.

546
00:34:09,150 --> 00:34:13,100
But when you have what you
think is a good idea and

547
00:34:13,100 --> 00:34:16,810
other people think is complete rubbish,
that's the sign of a really good idea.

548
00:34:18,026 --> 00:34:21,555
>> I see, and research topics,

549
00:34:21,555 --> 00:34:26,183
new grad students should
work on capsules and

550
00:34:26,183 --> 00:34:30,707
maybe unsupervised learning, any other?

551
00:34:30,707 --> 00:34:34,078
>> One good piece of advice for
new grad students is,

552
00:34:34,078 --> 00:34:38,344
see if you can find an advisor
who has beliefs similar to yours.

553
00:34:38,344 --> 00:34:42,637
Because if you work on stuff that
your advisor feels deeply about,

554
00:34:42,637 --> 00:34:47,170
you'll get a lot of good advice and
time from your advisor.

555
00:34:47,170 --> 00:34:50,590
If you work on stuff your
advisor's not interested in,

556
00:34:50,590 --> 00:34:55,262
all you'll get is, you get some advice,
but it won't be nearly so useful.

557
00:34:55,262 --> 00:34:58,386
>> I see, and
last one on advice for learners,

558
00:34:58,386 --> 00:35:02,440
how do you feel about people
entering a PhD program?

559
00:35:02,440 --> 00:35:09,687
Versus joining a top company,
or a top research group?

560
00:35:09,687 --> 00:35:13,890
>> Yeah, it's complicated,
I think right now, what's happening is,

561
00:35:13,890 --> 00:35:18,727
there aren't enough academics trained in
deep learning to educate all the people

562
00:35:18,727 --> 00:35:21,125
that we need educated in universities.

563
00:35:21,125 --> 00:35:25,011
There just isn't the faculty
bandwidth there, but

564
00:35:25,011 --> 00:35:27,780
I think that's going to be temporary.

565
00:35:27,780 --> 00:35:32,410
I think what's happened is,
most departments have been very slow to

566
00:35:32,410 --> 00:35:34,890
understand the kind of
revolution that's going on.

567
00:35:34,890 --> 00:35:38,720
I kind of agree with you, that it's not
quite a second industrial revolution, but

568
00:35:38,720 --> 00:35:41,000
it's something on nearly that scale.

569
00:35:41,000 --> 00:35:43,691
And there's a huge sea change going on,

570
00:35:43,691 --> 00:35:47,980
basically because our relationship
to computers has changed.

571
00:35:47,980 --> 00:35:53,920
Instead of programming them,
we now show them, and they figure it out.

572
00:35:53,920 --> 00:35:56,570
That's a completely different
way of using computers, and

573
00:35:56,570 --> 00:36:01,210
computer science departments are built
around the idea of programming computers.

574
00:36:01,210 --> 00:36:03,480
And they don't understand that sort of,

575
00:36:05,000 --> 00:36:09,330
this showing computers is going to
be as big as programming computers.

576
00:36:09,330 --> 00:36:13,940
Except they don't understand that half the
people in the department should be people

577
00:36:13,940 --> 00:36:16,510
who get computers to do
things by showing them.

578
00:36:16,510 --> 00:36:22,183
So my department refuses to acknowledge
that it should have lots and

579
00:36:22,183 --> 00:36:24,790
lots of people doing this.

580
00:36:24,790 --> 00:36:28,730
They think they got a couple,
maybe a few more, but not too many.

581
00:36:31,260 --> 00:36:32,452
And in that situation,

582
00:36:32,452 --> 00:36:36,510
you have to remind the big companies
to do quite a lot of the training.

583
00:36:36,510 --> 00:36:40,335
So Google is now training people,
we call brain residence,

584
00:36:40,335 --> 00:36:43,792
I suspect the universities
will eventually catch up.

585
00:36:43,792 --> 00:36:48,360
>> I see, right, in fact, maybe a lot
of students have figured this out.

586
00:36:48,360 --> 00:36:53,131
A lot of top 50 programs,
over half of the applicants are actually

587
00:36:53,131 --> 00:36:57,079
wanting to work on showing,
rather than programming.

588
00:36:57,079 --> 00:37:00,720
Yeah, cool, yeah, in fact,
to give credit where it's due,

589
00:37:00,720 --> 00:37:04,930
whereas a deep learning AI is creating
a deep learning specialization.

590
00:37:04,930 --> 00:37:09,239
As far as I know, their first deep
learning MOOC was actually yours taught

591
00:37:09,239 --> 00:37:11,752
on Coursera, back in 2012, as well.

592
00:37:12,828 --> 00:37:14,430
And somewhat strangely,

593
00:37:14,430 --> 00:37:18,900
that's when you first published the RMS
algorithm, which also is a rough.

594
00:37:20,240 --> 00:37:25,910
>> Right, yes, well, as you know, that was
because you invited me to do the MOOC.

595
00:37:25,910 --> 00:37:30,239
And then when I was very dubious about
doing, you kept pushing me to do it, so

596
00:37:30,239 --> 00:37:34,340
it was very good that I did,
although it was a lot of work.

597
00:37:34,340 --> 00:37:37,409
>> Yes, and thank you for doing that,
I remember you complaining to me,

598
00:37:37,409 --> 00:37:38,351
how much work it was.

599
00:37:38,351 --> 00:37:42,413
And you staying out late at night,
but I think many, many learners have

600
00:37:42,413 --> 00:37:47,330
benefited for your first MOOC, so
I'm very grateful to you for it, so.

601
00:37:47,330 --> 00:37:49,260
>> That's good, yeah
>> Yeah, over the years,

602
00:37:49,260 --> 00:37:53,290
I've seen you embroiled in debates
about paradigms for AI, and

603
00:37:53,290 --> 00:37:57,030
whether there's been a paradigm shift for
AI.

604
00:37:57,030 --> 00:37:59,984
What are your,
can you share your thoughts on that?

605
00:37:59,984 --> 00:38:05,157
>> Yes, happily, so I think that in
the early days, back in the 50s,

606
00:38:05,157 --> 00:38:10,335
people like von Neumann and
didn't believe in symbolic AI,

607
00:38:10,335 --> 00:38:14,220
they were far more inspired by the brain.

608
00:38:14,220 --> 00:38:20,127
Unfortunately, they both died much too
young, and their voice wasn't heard.

609
00:38:20,127 --> 00:38:21,806
And in the early days of AI,

610
00:38:21,806 --> 00:38:26,259
people were completely convinced that
the representations you need for

611
00:38:26,259 --> 00:38:30,500
intelligence were symbolic
expressions of some kind.

612
00:38:30,500 --> 00:38:35,509
Sort of cleaned up logic, where you could
do nomeratonic things, and not quite

613
00:38:35,509 --> 00:38:41,143
logic, but something like logic, and that
the essence of intelligence was reasoning.

614
00:38:41,143 --> 00:38:45,662
What's happened now is,
there's a completely different view,

615
00:38:45,662 --> 00:38:50,984
which is that what a thought is, is just
a great big vector of neural activity,

616
00:38:50,984 --> 00:38:55,200
so contrast that with a thought
being a symbolic expression.

617
00:38:55,200 --> 00:38:59,087
And I think the people who thought that
thoughts were symbolic expressions just

618
00:38:59,087 --> 00:39:00,140
made a huge mistake.

619
00:39:01,210 --> 00:39:07,030
What comes in is a string of words, and
what comes out is a string of words.

620
00:39:08,140 --> 00:39:12,580
And because of that, strings of words
are the obvious way to represent things.

621
00:39:12,580 --> 00:39:15,710
So they thought what must be in
between was a string of words, or

622
00:39:15,710 --> 00:39:18,360
something like a string of words.

623
00:39:18,360 --> 00:39:21,310
And I think what's in between is
nothing like a string of words.

624
00:39:21,310 --> 00:39:26,060
I think the idea that thoughts must be
in some kind of language is as silly as

625
00:39:26,060 --> 00:39:30,980
the idea that understanding
the layout of a spatial scene

626
00:39:30,980 --> 00:39:34,280
must be in pixels, pixels come in.

627
00:39:34,280 --> 00:39:37,930
And if we could, if we had a dot
matrix printer attached to us,

628
00:39:37,930 --> 00:39:41,929
then pixels would come out, but
what's in between isn't pixels.

629
00:39:43,210 --> 00:39:46,620
And so I think thoughts are just
these great big vectors, and

630
00:39:46,620 --> 00:39:48,460
that big vectors have causal powers.

631
00:39:48,460 --> 00:39:50,490
They cause other big vectors, and

632
00:39:50,490 --> 00:39:56,100
that's utterly unlike the standard AI view
that thoughts are symbolic expressions.

633
00:39:56,100 --> 00:39:56,700
>> I see, good,

634
00:39:57,740 --> 00:40:01,560
I guess AI is certainly coming round
to this new point of view these days.

635
00:40:01,560 --> 00:40:02,660
>> Some of it,

636
00:40:02,660 --> 00:40:08,230
I think a lot of people in AI still think
thoughts have to be symbolic expressions.

637
00:40:08,230 --> 00:40:09,780
>> Thank you very much for
doing this interview.

638
00:40:09,780 --> 00:40:12,970
It was fascinating to hear how deep
learning has evolved over the years,

639
00:40:12,970 --> 00:40:17,680
as well as how you're still helping drive
it into the future, so thank you, Jeff.

640
00:40:17,680 --> 00:40:19,038
>> Well, thank you for
giving me this opportunity.

641
00:40:19,038 --> 00:40:20,147
>> Thank you.