1
00:00:03,576 --> 00:00:06,447
Hi Yann, you've been such a leader for
Deep Learning for so long,

2
00:00:06,447 --> 00:00:08,730
thanks a lot for doing this with us.
>> Well, thanks for

3
00:00:08,730 --> 00:00:09,852
having me.
>> So,

4
00:00:09,852 --> 00:00:12,820
you've been working on neural nets for
a long time.

5
00:00:12,820 --> 00:00:17,260
I would love to hear your personal story
of how you got started in AI, how did you

6
00:00:17,260 --> 00:00:22,097
networking with neural networks?
>> So, I was always interested in

7
00:00:22,097 --> 00:00:27,957
intelligence, in general,
the origins of intelligence in humans.

8
00:00:27,957 --> 00:00:32,127
Got me interested into human
evolution when I was a kid.

9
00:00:32,127 --> 00:00:33,060
>> That was in France?

10
00:00:33,060 --> 00:00:33,862
>> It was in France, yeah.

11
00:00:33,862 --> 00:00:37,690
I was in middle school or something and

12
00:00:37,690 --> 00:00:42,484
I was interested in technology,
space, etc.

13
00:00:42,484 --> 00:00:44,840
My favorite movie was
2001: A Space Odyssey.

14
00:00:44,840 --> 00:00:48,438
You had intelligent machines,
space travel, and

15
00:00:48,438 --> 00:00:53,495
human evolution as kind of the themes
that was what I was fascinated by.

16
00:00:53,495 --> 00:00:57,160
And the concept of intelligent machines
I think really kind of appealed to me.

17
00:00:57,160 --> 00:01:00,820
And then I studied electrical engineering.

18
00:01:00,820 --> 00:01:05,086
And when I was at school, I was maybe
in second year of engineering school,

19
00:01:05,086 --> 00:01:08,554
I stumbled on a book,
which was actually a philosophy book.

20
00:01:08,554 --> 00:01:14,181
It was a debate between Noam Chomsky,
the computational linguist at MIT,

21
00:01:14,181 --> 00:01:18,112
and Jean Piaget who is
a cognitive psychologist sort

22
00:01:18,112 --> 00:01:22,210
of psychology of child
development in Switzerland.

23
00:01:22,210 --> 00:01:25,934
And it was basically a debate
between nature and nurture,

24
00:01:25,934 --> 00:01:31,026
where Chomsky arguing for the fact that
language has a lot of innate structure,

25
00:01:31,026 --> 00:01:34,220
and Piaget saying a lot of it is learned,
and etc.

26
00:01:34,220 --> 00:01:40,218
And on the side of Piaget was
a transcription of a person who,

27
00:01:40,218 --> 00:01:48,670
each of these guys sort of brought their
teams of people to argue for their side.

28
00:01:48,670 --> 00:01:53,280
And on the side of Piaget was
Seymour Papert from MIT, who had worked

29
00:01:53,280 --> 00:01:57,382
on the perceptron model, one of
the first machines capable of running.

30
00:01:57,382 --> 00:02:00,273
And I never heard of the perceptron,
and I read this article that say,

31
00:02:00,273 --> 00:02:02,535
machine capable of running,
that sounds wonderful.

32
00:02:02,535 --> 00:02:07,011
And so I started going to several
university libraries and searching for

33
00:02:07,011 --> 00:02:11,926
everything I could find that talked about
the perceptron and realized there was

34
00:02:11,926 --> 00:02:16,615
a lot of papers from the 50s, but
it kind of stopped at the end of the 60s,

35
00:02:16,615 --> 00:02:20,480
with a book that was co-authored
by the same Seymour Papert.

36
00:02:20,480 --> 00:02:21,490
>> What year was this?

37
00:02:21,490 --> 00:02:22,792
>> So this was in 1980,

38
00:02:22,792 --> 00:02:23,920
roughly?
>> Right.

39
00:02:23,920 --> 00:02:28,470
>> And so I did a couple of projects with

40
00:02:28,470 --> 00:02:32,060
some of the math professor in my school
on kind of neural nets, essentially.

41
00:02:32,060 --> 00:02:35,118
But there was no one there I could
talk to who had worked on this,

42
00:02:35,118 --> 00:02:38,596
because the field basically had
disappeared in the meantime, right?

43
00:02:38,596 --> 00:02:40,616
Since 1980, nobody was working on this.

44
00:02:42,791 --> 00:02:45,363
And experimented with this a little bit,

45
00:02:45,363 --> 00:02:50,740
writing kind of simulation software of
various kinds, reading about neuroscience.

46
00:02:52,090 --> 00:02:58,120
When I finished my engineering studies,
I studied chip design.

47
00:02:58,120 --> 00:03:01,710
I'm good at site design at the time, so
it's something completely different.

48
00:03:01,710 --> 00:03:05,340
And when I finished I really
wanted to do research on this and

49
00:03:05,340 --> 00:03:09,442
I figured out already that at the time
the important question was how you train

50
00:03:09,442 --> 00:03:10,630
neural nets with multiple layers.

51
00:03:10,630 --> 00:03:15,152
It was pretty clear in the literature
of the 60s that that was the important

52
00:03:15,152 --> 00:03:20,040
question that had been left unsolved and
their idea of hierarchy and everything.

53
00:03:20,040 --> 00:03:24,152
I'd read Fukushima's article
on the neocognitron, right?

54
00:03:24,152 --> 00:03:29,951
Which was this sort of hierarchical
architecture very similar to now what we

55
00:03:29,951 --> 00:03:36,765
now call convolutional nets, but without
really backprop style learning algorithms.

56
00:03:36,765 --> 00:03:43,822
And I met people who were in
a small independent club in France.

57
00:03:43,822 --> 00:03:47,796
They were interested in what they
called at the time, Automata Networks.

58
00:03:47,796 --> 00:03:50,040
And they gave me a couple papers,

59
00:03:50,040 --> 00:03:54,782
the people on functional networks
which is not very popular anymore.

60
00:03:54,782 --> 00:03:59,888
But it's the first associative memories
with neural net and that paper can revive

61
00:03:59,888 --> 00:04:04,715
the interest of some research committees
into neural net in the early 80s.

62
00:04:04,715 --> 00:04:09,732
Where by mostly physicists and
condense matter physicists and

63
00:04:09,732 --> 00:04:14,558
a few psychologists,
it was still not okay for engineers and

64
00:04:14,558 --> 00:04:18,464
computer scientists to
talk about neural nets.

65
00:04:18,464 --> 00:04:23,566
And they also should be another
paper that had just been distributed

66
00:04:23,566 --> 00:04:28,590
as a pre-print, whose title was
Optimal Perceptual Inference.

67
00:04:28,590 --> 00:04:32,675
And this was the first paper on
Boltzmann machines by Geoff Hinton and

68
00:04:32,675 --> 00:04:33,811
Terry Sejnowski.

69
00:04:33,811 --> 00:04:34,801
It was talking about hidden units.

70
00:04:34,801 --> 00:04:40,401
It was talking about, basically,
the part of learning,

71
00:04:40,401 --> 00:04:46,702
multilayer neural nets are more
capable than just classifiers.

72
00:04:46,702 --> 00:04:47,697
So I said,

73
00:04:47,697 --> 00:04:51,350
I need to meet these people [LAUGH].
>> Wow.

74
00:04:51,350 --> 00:04:52,093
>> Because they're only

75
00:04:52,093 --> 00:04:53,230
interested in the right problem.

76
00:04:54,756 --> 00:05:00,330
And a couple of years later,
after I started my PhD, I participated in

77
00:05:00,330 --> 00:05:06,020
a workshop in Le Juch that was organized
by the people I was working with.

78
00:05:06,020 --> 00:05:10,630
And Terry was one of
the speakers at the workshop, so

79
00:05:10,630 --> 00:05:13,216
I met him at that time.
>> It was like early 80s now.

80
00:05:13,216 --> 00:05:15,862
>> This is 1985, early 1985.

81
00:05:15,862 --> 00:05:19,777
So I met Terry Sejnowski in 1985 in
the workshop in France in Le Juch and

82
00:05:19,777 --> 00:05:23,881
a lot of people were there, founders
of early neural net, jump up field and,

83
00:05:23,881 --> 00:05:27,820
a lot of people working on theoretical
neuroscience and stuff like that.

84
00:05:27,820 --> 00:05:29,930
It was a fascinating workshop.

85
00:05:29,930 --> 00:05:36,260
I met also, a couple of people from Bell
Labs who eventually hired me at Bell Labs,

86
00:05:36,260 --> 00:05:38,590
but this was several years
before I finished my PhD.

87
00:05:38,590 --> 00:05:42,821
So I talked to Terry Sejnowski and I was
telling him about what I was working on

88
00:05:42,821 --> 00:05:45,479
which was some version
of backprop at the time.

89
00:05:45,479 --> 00:05:49,559
This is before backprop was a paper and

90
00:05:49,559 --> 00:05:54,030
Terry was working on net talk at the time.

91
00:05:55,430 --> 00:05:57,212
This was before the Rumelhart,

92
00:05:57,212 --> 00:06:00,453
Hinton, Williams paper on
backprop had been published.

93
00:06:00,453 --> 00:06:04,158
But he was friends with Geoff,
this information was circulating,

94
00:06:04,158 --> 00:06:07,733
so he was already working on trying
to make this work for net talk,

95
00:06:07,733 --> 00:06:10,804
but he didn't tell me.
>> I see.

96
00:06:10,804 --> 00:06:11,848
>> And he went back to US and

97
00:06:11,848 --> 00:06:14,882
told Geoff there is some kid in France
who's working on the same stuff

98
00:06:14,882 --> 00:06:16,672
we're working on.
>> I see.

99
00:06:16,672 --> 00:06:19,117
>> [LAUGH] And then a few months later,

100
00:06:19,117 --> 00:06:25,120
in June, there was another conference in
France where Geoff was a keynote speaker.

101
00:06:26,160 --> 00:06:28,250
And he gave a talk on Boltzmann machines.

102
00:06:28,250 --> 00:06:30,920
Of course,
he was working on the backprop paper.

103
00:06:32,060 --> 00:06:34,230
And he gave this talk, and

104
00:06:34,230 --> 00:06:37,520
then there was 50 people around
him who wanted to talk to him.

105
00:06:37,520 --> 00:06:40,210
And the first thing he
said to the organizer is,

106
00:06:40,210 --> 00:06:41,380
do you know this guy, Yann LeCun?

107
00:06:41,380 --> 00:06:45,025
And it's because he had read my paper
in the proceedings that was written

108
00:06:45,025 --> 00:06:45,900
in French.

109
00:06:45,900 --> 00:06:47,890
And he could sort of read French and
he could see the math and

110
00:06:47,890 --> 00:06:51,600
he could figure out what sort of backprop,
and so we had lunch together and

111
00:06:51,600 --> 00:06:53,496
that's how we became friends.
>> I see, well.

112
00:06:53,496 --> 00:06:54,319
>> [LAUGH]

113
00:06:54,319 --> 00:06:56,583
>> So that's because multiple groups

114
00:06:56,583 --> 00:07:00,700
independently reinvented or
invented backprop pretty much.

115
00:07:00,700 --> 00:07:02,906
>> Right, well, we realized that the whole

116
00:07:02,906 --> 00:07:06,878
idea with Chain Rule or what the optimal
control people call the joint state

117
00:07:06,878 --> 00:07:10,870
method which is really the context in
which backprop was really invented.

118
00:07:10,870 --> 00:07:14,105
This in context of optimal
control back in the early 60s.

119
00:07:14,105 --> 00:07:19,649
This idea that you could use graded
descent basically with multiple stages

120
00:07:19,649 --> 00:07:26,160
is what backprop really is and that popped
up in various contexts at various times.

121
00:07:26,160 --> 00:07:30,849
And but I think the Rumelhart,
Hinton, Williams paper is

122
00:07:30,849 --> 00:07:35,560
the one that popularized it.
>> I see, yeah, no, cool, yeah.

123
00:07:35,560 --> 00:07:39,800
And then fast forward a few years,
you wound up at AT&T Bell Labs,

124
00:07:39,800 --> 00:07:45,865
where you invented, among many things, the
net, which we talk about in the course.

125
00:07:45,865 --> 00:07:49,301
And I remember when way back,
I was a summer intern at AT&T Bell Labs,

126
00:07:49,301 --> 00:07:51,981
where I worked with Michael Kerns and
a few others, and

127
00:07:51,981 --> 00:07:54,093
of hearing about your work even back then.

128
00:07:54,093 --> 00:07:57,509
So tell me more about your AT&T,
the net, experience.

129
00:07:57,509 --> 00:07:58,810
>> Okay, so what happened is,

130
00:07:58,810 --> 00:08:04,400
I actually started working on
convolutional net when I was A postdoc,

131
00:08:04,400 --> 00:08:06,090
University of Toronto, chief intern.

132
00:08:07,180 --> 00:08:09,075
I did the first experiment,
I wrote the code there, and

133
00:08:09,075 --> 00:08:10,849
I did the first experiments
there that showed that,

134
00:08:10,849 --> 00:08:11,953
if you had a very small data set.

135
00:08:11,953 --> 00:08:16,995
The data set I was training on, there
was no or anything like that back then.

136
00:08:16,995 --> 00:08:20,195
So I drew a bunch of
characters with my mouse.

137
00:08:20,195 --> 00:08:23,285
I had an Amiga, a personal computer,
which was the best computer ever.

138
00:08:23,285 --> 00:08:27,425
And I drew a bunch of characters and
then used that.

139
00:08:27,425 --> 00:08:30,345
I did augmentation to kind of increase it,
and

140
00:08:30,345 --> 00:08:32,635
then used that as a way
to test performance.

141
00:08:32,635 --> 00:08:35,125
And I compared things like
fully connected nets,

142
00:08:35,125 --> 00:08:36,935
locally connected nets
without shared weights.

143
00:08:36,935 --> 00:08:38,190
And then shared weight networks.

144
00:08:38,190 --> 00:08:40,000
Which was basically the first comment.

145
00:08:40,000 --> 00:08:46,090
And that worked really well for relatively
small data sets, could show that you get

146
00:08:46,090 --> 00:08:50,320
better performance and no over-training
with conventional architecture.

147
00:08:50,320 --> 00:08:53,481
And when I go to Bell Labs
in October 1988,

148
00:08:53,481 --> 00:08:57,212
the first thing I did was first,
scale up the network,

149
00:08:57,212 --> 00:09:01,935
because we had faster computers a few
months before I go to Bell Labs.

150
00:09:01,935 --> 00:09:06,501
My boss at the time, Larry Jackal,
who became a department head of

151
00:09:06,501 --> 00:09:09,920
said we should order a computer for
you before you come.

152
00:09:09,920 --> 00:09:10,450
Where do you want?

153
00:09:10,450 --> 00:09:15,860
I say well, here Toronto,
there is which was the stuff.

154
00:09:15,860 --> 00:09:17,432
It'd be great if we had one.

155
00:09:17,432 --> 00:09:21,450
And they ordered one and
I had one for myself.

156
00:09:21,450 --> 00:09:24,660
At University of Toronto it was one for
the entire department, right?

157
00:09:24,660 --> 00:09:25,840
One just for me, right?

158
00:09:25,840 --> 00:09:30,030
And so Larry told me he said, you know at
Bell Labs you don't get famous by saving

159
00:09:30,030 --> 00:09:30,866
money.
>> [LAUGH]

160
00:09:30,866 --> 00:09:34,483
>> So that was awesome, and they had been

161
00:09:34,483 --> 00:09:39,810
working already for
awhile on character recognition.

162
00:09:39,810 --> 00:09:44,883
They had this enormous data set called
USDS that had 5,000 training samples.

163
00:09:44,883 --> 00:09:51,700
[LAUGH] And so immediately I trained a
net, which was in the net one, basically.

164
00:09:51,700 --> 00:09:53,645
And trained it on this data set and

165
00:09:53,645 --> 00:09:58,320
got really good results,
better results than the other methods.

166
00:09:58,320 --> 00:10:04,150
They had tried on it, and that other
people had tried on it is that so

167
00:10:04,150 --> 00:10:07,430
that, we knew we had
something fairly early on.

168
00:10:07,430 --> 00:10:11,420
This was within three months
of me joining Bell Labs.

169
00:10:11,420 --> 00:10:14,690
And so that was the first
version of commercial net where

170
00:10:14,690 --> 00:10:19,580
we had a convolution with stride,
and we did not have separate and

171
00:10:19,580 --> 00:10:21,240
pulling layers.
>> Mm-hm.

172
00:10:21,240 --> 00:10:23,690
>> So each convolution was actually

173
00:10:23,690 --> 00:10:24,390
directly.

174
00:10:24,390 --> 00:10:25,050
And the reason for

175
00:10:25,050 --> 00:10:30,130
this is that we just could not afford to
have a convolution at every location.

176
00:10:30,130 --> 00:10:32,299
There was just too much computation.
>> I see.

177
00:10:32,299 --> 00:10:35,839
>> [COUGH] So, the second version had

178
00:10:35,839 --> 00:10:42,040
a separate convolution and
pulling the air in something.

179
00:10:43,720 --> 00:10:47,070
I guess that's the one
that's called one really.

180
00:10:47,070 --> 00:10:53,380
So we published a couple papers
on this at competitions in Nips.

181
00:10:53,380 --> 00:10:57,270
And so, interesting story,
did you ever talk to Nips about this work?

182
00:10:58,460 --> 00:11:01,580
And Jeffrey Ton was in the audience, and
then you know I came back to my seat,

183
00:11:01,580 --> 00:11:05,920
I was sitting next to him and he said,
there's one bit of information in your

184
00:11:05,920 --> 00:11:08,570
talk which is that,
if you do all the sensible things,

185
00:11:08,570 --> 00:11:10,503
it actually works.
>> [LAUGH]

186
00:11:10,503 --> 00:11:12,871
>> Then that showed the after deadline of

187
00:11:12,871 --> 00:11:16,820
work went on to make history
because it became widely adopted.

188
00:11:16,820 --> 00:11:18,540
These ideas became widely adopted for

189
00:11:18,540 --> 00:11:20,570
reading cheques and-
>> Yeah,

190
00:11:20,570 --> 00:11:26,200
the bigger value adopted within AT&T but
not very much outside.

191
00:11:26,200 --> 00:11:29,368
And I think it's a little difficult for

192
00:11:29,368 --> 00:11:34,560
me to really understand why, but
the simple factor [INAUDIBLE].

193
00:11:34,560 --> 00:11:40,360
So this was back in the late 80s,
and there was no Internet.

194
00:11:40,360 --> 00:11:45,110
We had email, we had FTP, but
there was no Internet, really.

195
00:11:45,110 --> 00:11:48,450
No two labs were using the same
software or hardware platform, right?

196
00:11:48,450 --> 00:11:51,270
Some people are at some workstations,
others had other machines,

197
00:11:51,270 --> 00:11:52,980
some people were using PCs or whatever.

198
00:11:52,980 --> 00:11:56,360
There was no such thing as Python or
MATLAB or anything like that, right?

199
00:11:56,360 --> 00:11:58,430
People are writing their own code.

200
00:11:58,430 --> 00:12:01,510
I had spent a year and
a half basically writing,

201
00:12:01,510 --> 00:12:04,442
me and when he was still a student.

202
00:12:04,442 --> 00:12:07,330
We're working together,
and we spent a year and

203
00:12:07,330 --> 00:12:10,580
a half basically just writing
a neural net simulator.

204
00:12:12,030 --> 00:12:14,360
And at the time because there
was no match-up with Python.

205
00:12:14,360 --> 00:12:16,040
You had to kind of write
your own interpreter, right?

206
00:12:16,040 --> 00:12:16,920
To kind of control it.

207
00:12:16,920 --> 00:12:19,070
So we want our own list of interpreter.

208
00:12:19,070 --> 00:12:24,160
And so all the networks written in
list using a numerical back hand.

209
00:12:24,160 --> 00:12:27,830
Very similar to what we have now with
blocks that you can interconnect and

210
00:12:27,830 --> 00:12:31,380
instead of many differentiation and
all that stuff that we;re familiar now,

211
00:12:31,380 --> 00:12:36,250
with torsion by torsion,
tensile flow and all those things.

212
00:12:37,400 --> 00:12:41,160
So then we developed
a bunch of applications.

213
00:12:41,160 --> 00:12:45,110
We got together with a group of engineers.

214
00:12:46,460 --> 00:12:47,440
Very smart people.

215
00:12:48,880 --> 00:12:52,230
Some of them were like theoretical

216
00:12:52,230 --> 00:12:56,186
physicists who kind of turned
engineer at the Bell Labs.

217
00:12:57,280 --> 00:13:00,070
Chris Dodgers was one
of them who then had to

218
00:13:01,900 --> 00:13:04,095
distinguished career at
Microsoft research afterwards.

219
00:13:04,095 --> 00:13:04,766
And Krieg Nolan.

220
00:13:04,766 --> 00:13:09,447
But keep on and we're collaborating
with them to kind of make this

221
00:13:09,447 --> 00:13:12,300
technology practical.
>> I see.

222
00:13:12,300 --> 00:13:12,830
>> And so

223
00:13:12,830 --> 00:13:17,840
together we developed this
characterization systems.

224
00:13:17,840 --> 00:13:22,614
And that meant integrating,
convolutional net with things like,

225
00:13:22,614 --> 00:13:27,471
similar to things like we now call
CRFs for interpreting sequences of

226
00:13:27,471 --> 00:13:30,830
characters not just individual address.
>> Yeah,

227
00:13:30,830 --> 00:13:33,710
right to the net paper had
partially under neural network and

228
00:13:33,710 --> 00:13:37,630
partially under atomic machinery
>> Right, to pull it together?

229
00:13:37,630 --> 00:13:38,230
>> Yeah, that's right.

230
00:13:38,230 --> 00:13:41,310
And so the first half on the paper
is on convolutional nets, and

231
00:13:41,310 --> 00:13:43,080
the paper is mostly cited for that.

232
00:13:43,080 --> 00:13:45,205
And then the second half,
very few people have read it,

233
00:13:45,205 --> 00:13:49,530
[LAUGH] and it's about sort of sequence
level, discriminative running, and

234
00:13:49,530 --> 00:13:53,900
basically structure prediction
with that normalization.

235
00:13:53,900 --> 00:13:57,060
So it's very similar to CRF, in fact.
>> Fascinating

236
00:13:57,060 --> 00:14:00,790
>> You know with PTCRFS over the years.

237
00:14:00,790 --> 00:14:06,770
So that was very successful,
except that the day we were

238
00:14:08,290 --> 00:14:12,300
celebrating the deployment of
that system in major bank,

239
00:14:13,450 --> 00:14:17,708
we worked with this group that I
was mentioning that was kind of

240
00:14:17,708 --> 00:14:19,640
doing the engineering of the whole system.

241
00:14:19,640 --> 00:14:23,650
And then another product group in
a different part of the country

242
00:14:23,650 --> 00:14:25,940
that belonged to a subsidiary
of AT&T called NCR.

243
00:14:25,940 --> 00:14:26,480
So this is the-
>> [CROSSTALK]

244
00:14:26,480 --> 00:14:29,280
>> National Cash Register, right.

245
00:14:29,280 --> 00:14:32,610
They also build large ATM machines, and

246
00:14:32,610 --> 00:14:36,250
they build large check
reading machines for banks.

247
00:14:36,250 --> 00:14:38,220
So they were the customers, if you want.

248
00:14:38,220 --> 00:14:41,100
They were using our check billing systems.

249
00:14:41,100 --> 00:14:43,570
And they had deployed it in a bank.

250
00:14:43,570 --> 00:14:45,300
I can't remember which bank it was.

251
00:14:45,300 --> 00:14:47,490
They deployed those, so
there were ATM machines in a French book.

252
00:14:47,490 --> 00:14:51,979
So they could read the check you
would deposit, and we were all at

253
00:14:51,979 --> 00:14:56,804
a fancy restaurant celebrating
the department of this thing where,

254
00:14:56,804 --> 00:15:00,982
when the company announced that
it was breaking itself up.

255
00:15:00,982 --> 00:15:02,311
So this was 1995 and

256
00:15:02,311 --> 00:15:06,970
AT&T announced that it was breaking
itself up into two companies.

257
00:15:06,970 --> 00:15:11,370
So there was AT&T, and then there
was Lucen Technologies, and NCR.

258
00:15:11,370 --> 00:15:14,670
So NCR was spun off, and
Lucent Technologies was spun off.

259
00:15:14,670 --> 00:15:17,090
And the engineering group went with Lucent
Technologies, and the product group,

260
00:15:17,090 --> 00:15:18,210
of course, went with NCR.

261
00:15:19,770 --> 00:15:24,620
And the sad thing is that the AT&T
lawyers in their infinite wisdom

262
00:15:24,620 --> 00:15:29,090
assigned the patents,
there was a patent on covolutional net

263
00:15:29,090 --> 00:15:31,991
which is thankfully expired.
>> I see [LAUGH].

264
00:15:31,991 --> 00:15:33,650
>> [LAUGH] Expired in 2007.

265
00:15:33,650 --> 00:15:36,370
About ten years ago.

266
00:15:36,370 --> 00:15:41,100
And they signed patent to NCR, but
there was nobody in NCR who actually

267
00:15:41,100 --> 00:15:44,710
knew even what a convolutional
net was really.

268
00:15:44,710 --> 00:15:48,470
And so the patent was in the hands of
people who had no idea what they had.

269
00:15:48,470 --> 00:15:51,254
And we were in a different company
that now could not really develop

270
00:15:51,254 --> 00:15:54,187
the technology, and our engineering
team was in a separate company,

271
00:15:54,187 --> 00:15:56,724
because we went with AT&T and
engineering went with Lucent, and

272
00:15:56,724 --> 00:15:58,140
the product group went with NCR.

273
00:15:58,140 --> 00:16:04,190
So it was a little depressing [LAUGH].
>> So in addition to your early work,

274
00:16:04,190 --> 00:16:08,980
when your networks were Part,
you kept persisting on neural networks

275
00:16:08,980 --> 00:16:12,020
even when there was some sort
of winter for neural net.

276
00:16:12,020 --> 00:16:15,126
So what was like that?
>> Well, so I persisted and

277
00:16:15,126 --> 00:16:16,884
didn't persist in some ways.

278
00:16:16,884 --> 00:16:21,555
I was always convinced that eventually
those techniques would come back to

279
00:16:21,555 --> 00:16:26,369
the fore, and sort of people would figure
out how to use them in practice, and

280
00:16:26,369 --> 00:16:27,650
it would be useful.

281
00:16:27,650 --> 00:16:30,901
So I always had that in
the back of my mind.

282
00:16:30,901 --> 00:16:33,661
But in 1996,
when AT&T broke itself up, and

283
00:16:33,661 --> 00:16:36,750
all of our work on character recognition,
basically,

284
00:16:36,750 --> 00:16:40,627
was kind of broken up because the part
of the group went in separate way,

285
00:16:40,627 --> 00:16:45,490
I was also promoted to department head,
and I had to figure out what to work on.

286
00:16:45,490 --> 00:16:49,597
And this was the early days of
the Internet, we're talking 1995.

287
00:16:49,597 --> 00:16:53,836
And I had the idea somehow
that one big problem about

288
00:16:53,836 --> 00:16:58,175
the emergence of the Internet
was going to be to bring

289
00:16:58,175 --> 00:17:03,120
all the knowledge that we had
on paper to the digital world.

290
00:17:03,120 --> 00:17:07,193
And so I started, actually,
a project called DjVu, D-J-V-U,

291
00:17:07,193 --> 00:17:10,694
which was to compress scanned documents,
essentially,

292
00:17:10,694 --> 00:17:13,635
so they could be distributed
over the Internet.

293
00:17:13,635 --> 00:17:17,528
And this project was really fun for a
while, and had some success, although AT&T

294
00:17:17,528 --> 00:17:21,443
really didn't know what to do with it.
>> Yeah, I remember that, really helping

295
00:17:21,443 --> 00:17:24,790
dissemination of online research papers.
>> Yeah, that's right, exactly.

296
00:17:24,790 --> 00:17:28,830
And we scanned the entire proceedings of
NIPS, and we made them available online-

297
00:17:28,830 --> 00:17:30,110
>> I see, I remember that.

298
00:17:30,110 --> 00:17:31,590
>> To kind of demonstrate how that worked.

299
00:17:31,590 --> 00:17:35,736
And we could compress high resolution
pages to just a few kilobytes.

300
00:17:35,736 --> 00:17:36,502
>> So ConvNet,

301
00:17:36,502 --> 00:17:39,988
starting from some of your much
earlier work has now come and

302
00:17:39,988 --> 00:17:43,336
pretty much taken over the field
of computer vision, and

303
00:17:43,336 --> 00:17:46,980
starting to encroach significantly
into even other fields.

304
00:17:46,980 --> 00:17:50,407
So just tell me about how
you saw that whole process.

305
00:17:50,407 --> 00:17:51,446
>> [LAUGH] So

306
00:17:51,446 --> 00:17:55,150
to tell you how I thought this
was going to happen early on.

307
00:17:55,150 --> 00:17:59,178
So first of all, I always believed
that this was going to work.

308
00:17:59,178 --> 00:18:04,074
It required fast computers and lots of
data, but I always believed, somehow,

309
00:18:04,074 --> 00:18:07,160
that this was going to be
the right thing to do.

310
00:18:07,160 --> 00:18:11,695
What I thought, originally, when I was at
Bell Labs, that there was going to be some

311
00:18:11,695 --> 00:18:16,392
sort of continuous progress along these
directions as machines got more powerful.

312
00:18:16,392 --> 00:18:20,874
And we were even designing chips to run
convolutional nets at Bell Labs, but

313
00:18:20,874 --> 00:18:25,566
now those are actually in hospital graph
separately had two different chips for

314
00:18:25,566 --> 00:18:28,593
running convolutional
nets really efficiently.

315
00:18:28,593 --> 00:18:33,186
And so we thought there was going to
be a kind of a pick up of this, and

316
00:18:33,186 --> 00:18:37,882
kind of growing interest and
sort of continuous progress for it.

317
00:18:37,882 --> 00:18:41,860
But in fact, because of the sort
of interest for neural nets,

318
00:18:41,860 --> 00:18:45,470
sort of dying in the mid-90s,
that didn't happen.

319
00:18:45,470 --> 00:18:51,444
So there was kind of a dark period of six
or seven years between 1995 roughly and

320
00:18:51,444 --> 00:18:55,351
2002 when basically nobody
was working on this.

321
00:18:55,351 --> 00:18:57,192
In fact, there was a little bit of work.

322
00:18:57,192 --> 00:19:01,971
There was some work at Microsoft
in the early 2000s on using

323
00:19:01,971 --> 00:19:06,401
convolutional nets for
Chinese character recognition.

324
00:19:08,676 --> 00:19:11,676
>> Group, yeah, exactly.

325
00:19:11,676 --> 00:19:14,844
And there was some other small work for
face detection and

326
00:19:14,844 --> 00:19:19,780
things like this in France, and in various
other places, but it was very small.

327
00:19:19,780 --> 00:19:24,400
I discovered actually recently
that a couple groups that came

328
00:19:24,400 --> 00:19:27,320
up with ideas that are essentially very
similar to convolutional nets, but

329
00:19:27,320 --> 00:19:31,370
never quite published it the same way for
medical image analysis.

330
00:19:31,370 --> 00:19:33,880
And those were mostly in
the context of commercial systems.

331
00:19:33,880 --> 00:19:37,310
And so
it never quite made it to the population.

332
00:19:37,310 --> 00:19:42,343
I mean, it was after our first work
on convolutional nets, and they were

333
00:19:42,343 --> 00:19:47,475
not really aware of it, but it sort of
developed in parallel a little bit.

334
00:19:47,475 --> 00:19:52,764
So several people got kind of similar
ideas several years interval.

335
00:19:52,764 --> 00:19:56,950
But then I was really
surprised by how fast

336
00:19:56,950 --> 00:20:01,250
interest picked up after the ImageNet-
>> 2012

337
00:20:01,250 --> 00:20:03,646
>> In 2012, so it's the end of 2012.

338
00:20:03,646 --> 00:20:07,707
It was kind of a very
interesting event at ECCV,

339
00:20:07,707 --> 00:20:12,389
in Florence,
where there was a workshop on ImageNet.

340
00:20:12,389 --> 00:20:19,552
And they already knew that
had won by a large margin.

341
00:20:19,552 --> 00:20:21,004
And so everybody was waiting for talk.

342
00:20:21,004 --> 00:20:25,717
And most people in the computer vision
community had no idea what a convolutional

343
00:20:25,717 --> 00:20:26,281
net was.

344
00:20:26,281 --> 00:20:27,210
I mean, they heard me talk about it.

345
00:20:27,210 --> 00:20:32,181
I actually had an invited talk at
CVPR in 2000 where I talked about it,

346
00:20:32,181 --> 00:20:35,560
but most people had not
paid much attention to it.

347
00:20:35,560 --> 00:20:37,822
Senior people did,
they knew what it was, but

348
00:20:37,822 --> 00:20:41,607
the more junior people in the community
were really, had no idea what it was.

349
00:20:41,607 --> 00:20:45,654
And so just gives his talk, and he doesn't
explain what a convolutional net is

350
00:20:45,654 --> 00:20:47,824
because he assumes everybody knows, right?

351
00:20:47,824 --> 00:20:53,093
because he comes from a so he says,
here's how everything is connected,

352
00:20:53,093 --> 00:20:56,753
and how we transform the data and
what results we get.

353
00:20:56,753 --> 00:20:59,450
Again, assuming that
everybody knows what it is.

354
00:20:59,450 --> 00:21:02,198
And a lot of people
are incredibly surprised.

355
00:21:02,198 --> 00:21:07,112
And you could see the opinion of people
changing as he was kind of giving

356
00:21:07,112 --> 00:21:11,946
his talk, very senior people in the field.
>> So you think that workshop was

357
00:21:11,946 --> 00:21:16,058
a defining moment that swayed a lot
of the computer vision community.

358
00:21:16,058 --> 00:21:16,724
>> Yeah, definitely.

359
00:21:16,724 --> 00:21:17,572
>> That's right, yeah.

360
00:21:17,572 --> 00:21:18,874
>> That's the way it happened, yeah,

361
00:21:18,874 --> 00:21:23,370
right there.
>> So today, you retain a faculty position

362
00:21:23,370 --> 00:21:27,998
at NYU, and you also lead FAIR,
Facebook AI Research.

363
00:21:27,998 --> 00:21:32,241
I know you have a pretty unique point of
view on how corporate research should

364
00:21:32,241 --> 00:21:33,230
be done.

365
00:21:33,230 --> 00:21:34,530
Do you want to share
your thoughts on that?

366
00:21:34,530 --> 00:21:37,688
>> Yeah, so I mean, one of the beautiful

367
00:21:37,688 --> 00:21:44,105
things that I managed to do at Facebook in
the last four years is that I was given

368
00:21:44,105 --> 00:21:50,128
a lot of freedom to setup FAIR the way
I thought was the most appropriate,

369
00:21:50,128 --> 00:21:56,010
because this was the first research
organization within Facebook.

370
00:21:56,010 --> 00:21:58,910
Facebook is a sort of
engineering-centric company.

371
00:21:58,910 --> 00:22:03,007
And so far was really focused on sort
of survival or short-term things.

372
00:22:03,007 --> 00:22:10,714
And Facebook was about to turn ten
years old, had a successful IPO.

373
00:22:10,714 --> 00:22:14,220
And was basically thinking about
the next ten years, right?

374
00:22:14,220 --> 00:22:18,188
I mean, Mark Zuckerberg was thinking,
what is going to be important for

375
00:22:18,188 --> 00:22:19,341
the next ten years?

376
00:22:19,341 --> 00:22:21,917
And the survival of the company
was not in question anymore.

377
00:22:21,917 --> 00:22:26,343
So this is the kind of transition where
a large company can start to think, or

378
00:22:26,343 --> 00:22:28,846
it was not such a large
company at the time.

379
00:22:28,846 --> 00:22:34,003
Facebook had 5,000 employees or
so, but it had the luxury to

380
00:22:34,003 --> 00:22:39,837
think about the next ten years and
what would be important in technology.

381
00:22:39,837 --> 00:22:45,069
And Mark and his team decided
that AI was going to be a crucial

382
00:22:45,069 --> 00:22:52,372
piece of technology for connecting people,
which is the mission of Facebook.

383
00:22:52,372 --> 00:22:55,303
And so they explored several ways
to kind of build an effort in AI.

384
00:22:55,303 --> 00:22:57,808
They had a small internal group,
engineering group,

385
00:22:57,808 --> 00:23:01,459
experimenting with convolutional nets and
stuff that were getting really good

386
00:23:01,459 --> 00:23:05,450
results in face recognition and various
other things, which peaked their interest.

387
00:23:05,450 --> 00:23:08,724
And they explored the idea of hiring
a bunch of young researchers, or

388
00:23:08,724 --> 00:23:10,820
acquiring a company, or things like this.

389
00:23:10,820 --> 00:23:14,200
And they settled on the idea of hiring
someone senior in the field, and

390
00:23:14,200 --> 00:23:18,097
then kind of setting up
a research organization.

391
00:23:20,210 --> 00:23:23,340
And it was a bit of a culture shock,
initially, because

392
00:23:23,340 --> 00:23:26,750
the way research operates in the company
is very different from engineering, right?

393
00:23:26,750 --> 00:23:29,250
You have longer time scales and horizon.

394
00:23:29,250 --> 00:23:32,672
And researchers tend to be very
conservative about the choice of places

395
00:23:32,672 --> 00:23:33,821
where they want to work.

396
00:23:33,821 --> 00:23:38,552
And I made it very clear very early
on that research needs to be open,

397
00:23:38,552 --> 00:23:43,034
that researchers need to not only
be encouraged to publish, but

398
00:23:43,034 --> 00:23:45,110
be even mandated to publish.

399
00:23:45,110 --> 00:23:50,970
And also be evaluated on
criteria that are similar to

400
00:23:50,970 --> 00:23:56,440
what we used to evaluate
academic researchers.

401
00:23:56,440 --> 00:24:01,644
[COUGH] And so what Mark and
Mike Schroepfer, the CTO of the company,

402
00:24:01,644 --> 00:24:07,140
who is my boss now, said, they said,
Facebook is a very open company.

403
00:24:07,140 --> 00:24:09,890
We distribute a lot of
stuff in open source.

404
00:24:13,188 --> 00:24:14,799
Schroepfer, the CTO,

405
00:24:14,799 --> 00:24:17,910
comes from the open source world.
>> Mozilla.

406
00:24:17,910 --> 00:24:19,890
>> He was from Mozilla before that,

407
00:24:19,890 --> 00:24:21,260
and a lot of people came from that world.

408
00:24:21,260 --> 00:24:24,440
So that was in the DNA of the company,
so that made me

409
00:24:24,440 --> 00:24:28,390
very confident that we could kind of
set up an open research organization.

410
00:24:28,390 --> 00:24:34,941
And then the fact that the company is not
obsessively compulsive about IP as some

411
00:24:34,941 --> 00:24:41,397
other companies are makes it much easier
to collaborate with universities and

412
00:24:41,397 --> 00:24:46,774
have arrangements by which a person
can have a foot in industry and

413
00:24:46,774 --> 00:24:49,555
a foot in academia.
>> And you find that valuable,

414
00:24:49,555 --> 00:24:52,630
yourself?
>> Absolutely, yes.

415
00:24:52,630 --> 00:24:56,261
Yeah, so if you look at my
publications over the last four years,

416
00:24:56,261 --> 00:24:59,696
the vast majority of them
are publications with my students at

417
00:24:59,696 --> 00:25:01,170
NYU.
>> I see.

418
00:25:01,170 --> 00:25:03,190
>> Because at Facebook,

419
00:25:03,190 --> 00:25:07,016
I did a lot of organizing the lab, hiring,

420
00:25:07,016 --> 00:25:12,029
set the direction and advising,
and things like this.

421
00:25:12,029 --> 00:25:16,345
But I don't get involved in individual
research projects to get my name on

422
00:25:16,345 --> 00:25:16,910
papers.

423
00:25:16,910 --> 00:25:20,478
And I don't care to get my name
on papers anymore, but it's-

424
00:25:20,478 --> 00:25:21,666
>> It's not sending out someone else to do

425
00:25:21,666 --> 00:25:23,580
your dirty work rather than doing
all the dirty work yourself.

426
00:25:23,580 --> 00:25:24,590
>> Exactly, and you never want to put

427
00:25:24,590 --> 00:25:27,390
yourself, you want to
stay behind the scene.

428
00:25:27,390 --> 00:25:30,539
You don't want to put yourself in
competition with people in your lab in

429
00:25:30,539 --> 00:25:32,721
that case.
>> I'm sure you get asked this a lot but

430
00:25:32,721 --> 00:25:35,760
hoping you answer for all the people
watching this video as well.

431
00:25:36,830 --> 00:25:40,719
What advice do you have for
someone wanting to get involved in the,

432
00:25:40,719 --> 00:25:42,459
break into AI?
>> [LAUGH] I mean,

433
00:25:42,459 --> 00:25:46,470
it's such a different world now than
when it was when I got started.

434
00:25:46,470 --> 00:25:51,820
But I think what's great now is it's very
easy for people to get involved at some

435
00:25:51,820 --> 00:25:57,030
level, the tools that are available are so
easy to use now, in terms of whatever.

436
00:25:57,030 --> 00:26:01,928
You can have a run through on the cheap
computer in your bedroom, [LAUGH] and

437
00:26:01,928 --> 00:26:06,905
basically train your conventional net or
your current net to do whatever,

438
00:26:06,905 --> 00:26:09,140
and there's a lot of tools.

439
00:26:09,140 --> 00:26:16,190
You can learn a lot from online material
about this without, it's not very onerous.

440
00:26:16,190 --> 00:26:19,860
So you see high school students
now playing with this right?

441
00:26:19,860 --> 00:26:24,930
Which is kind of great, I think and
they certainly are growing interest

442
00:26:24,930 --> 00:26:29,730
from the student population to learn
about machine learning and AI and

443
00:26:29,730 --> 00:26:36,820
it's very exciting for young people and
I find that wonderful I think.

444
00:26:36,820 --> 00:26:42,430
So my advice is, if you want to get
into this, make yourself useful.

445
00:26:42,430 --> 00:26:45,260
So make a contribution to an open
source project, for example.

446
00:26:45,260 --> 00:26:49,810
Or make an implementation of some standard
algorithm that you can't find the code of

447
00:26:49,810 --> 00:26:54,600
online, but you'd like to make
it available to other people.

448
00:26:54,600 --> 00:26:56,610
So take a paper that
you think is important,

449
00:26:56,610 --> 00:27:01,080
and then re-implement the algorithm,
and then put it open source package,

450
00:27:01,080 --> 00:27:04,260
or contribute to one of
those open source packages.

451
00:27:04,260 --> 00:27:09,132
And if the stuff you write is interesting
and useful, you'll get noticed.

452
00:27:09,132 --> 00:27:14,030
Maybe you'll get a nice job at
a company you really wanted a job at,

453
00:27:14,030 --> 00:27:18,580
or maybe you'll get accepted in your
favorite PhD program or things like this.

454
00:27:18,580 --> 00:27:19,950
So I think that's a good way to get

455
00:27:19,950 --> 00:27:20,962
started.
>> So

456
00:27:20,962 --> 00:27:24,973
open source contributions is a good way to
enter the community, give back to learn.

457
00:27:24,973 --> 00:27:26,368
>> Yeah, that's right,

458
00:27:26,368 --> 00:27:29,651
that's right.
>> Thanks a lot Jan that was fascinating,

459
00:27:29,651 --> 00:27:32,520
I've known you for many years and
it's still fascinating to hear all these

460
00:27:32,520 --> 00:27:34,813
details of all the stories that
have gone in over the years.

461
00:27:34,813 --> 00:27:37,248
>> Yeah, there's many stories like this

462
00:27:37,248 --> 00:27:41,895
that, reflecting back at the moment
when they happen you don't realize,

463
00:27:41,895 --> 00:27:45,380
what importance it might take 10 or
20 years later.

464
00:27:45,380 --> 00:27:47,113
>> Yeah, thank you.

465
00:27:47,113 --> 00:27:48,678
>> Thanks.