1 00:00:03,576 --> 00:00:06,447 Hi Yann, you've been such a leader for Deep Learning for so long, 2 00:00:06,447 --> 00:00:08,730 thanks a lot for doing this with us. >> Well, thanks for 3 00:00:08,730 --> 00:00:09,852 having me. >> So, 4 00:00:09,852 --> 00:00:12,820 you've been working on neural nets for a long time. 5 00:00:12,820 --> 00:00:17,260 I would love to hear your personal story of how you got started in AI, how did you 6 00:00:17,260 --> 00:00:22,097 networking with neural networks? >> So, I was always interested in 7 00:00:22,097 --> 00:00:27,957 intelligence, in general, the origins of intelligence in humans. 8 00:00:27,957 --> 00:00:32,127 Got me interested into human evolution when I was a kid. 9 00:00:32,127 --> 00:00:33,060 >> That was in France? 10 00:00:33,060 --> 00:00:33,862 >> It was in France, yeah. 11 00:00:33,862 --> 00:00:37,690 I was in middle school or something and 12 00:00:37,690 --> 00:00:42,484 I was interested in technology, space, etc. 13 00:00:42,484 --> 00:00:44,840 My favorite movie was 2001: A Space Odyssey. 14 00:00:44,840 --> 00:00:48,438 You had intelligent machines, space travel, and 15 00:00:48,438 --> 00:00:53,495 human evolution as kind of the themes that was what I was fascinated by. 16 00:00:53,495 --> 00:00:57,160 And the concept of intelligent machines I think really kind of appealed to me. 17 00:00:57,160 --> 00:01:00,820 And then I studied electrical engineering. 18 00:01:00,820 --> 00:01:05,086 And when I was at school, I was maybe in second year of engineering school, 19 00:01:05,086 --> 00:01:08,554 I stumbled on a book, which was actually a philosophy book. 20 00:01:08,554 --> 00:01:14,181 It was a debate between Noam Chomsky, the computational linguist at MIT, 21 00:01:14,181 --> 00:01:18,112 and Jean Piaget who is a cognitive psychologist sort 22 00:01:18,112 --> 00:01:22,210 of psychology of child development in Switzerland. 23 00:01:22,210 --> 00:01:25,934 And it was basically a debate between nature and nurture, 24 00:01:25,934 --> 00:01:31,026 where Chomsky arguing for the fact that language has a lot of innate structure, 25 00:01:31,026 --> 00:01:34,220 and Piaget saying a lot of it is learned, and etc. 26 00:01:34,220 --> 00:01:40,218 And on the side of Piaget was a transcription of a person who, 27 00:01:40,218 --> 00:01:48,670 each of these guys sort of brought their teams of people to argue for their side. 28 00:01:48,670 --> 00:01:53,280 And on the side of Piaget was Seymour Papert from MIT, who had worked 29 00:01:53,280 --> 00:01:57,382 on the perceptron model, one of the first machines capable of running. 30 00:01:57,382 --> 00:02:00,273 And I never heard of the perceptron, and I read this article that say, 31 00:02:00,273 --> 00:02:02,535 machine capable of running, that sounds wonderful. 32 00:02:02,535 --> 00:02:07,011 And so I started going to several university libraries and searching for 33 00:02:07,011 --> 00:02:11,926 everything I could find that talked about the perceptron and realized there was 34 00:02:11,926 --> 00:02:16,615 a lot of papers from the 50s, but it kind of stopped at the end of the 60s, 35 00:02:16,615 --> 00:02:20,480 with a book that was co-authored by the same Seymour Papert. 36 00:02:20,480 --> 00:02:21,490 >> What year was this? 37 00:02:21,490 --> 00:02:22,792 >> So this was in 1980, 38 00:02:22,792 --> 00:02:23,920 roughly? >> Right. 39 00:02:23,920 --> 00:02:28,470 >> And so I did a couple of projects with 40 00:02:28,470 --> 00:02:32,060 some of the math professor in my school on kind of neural nets, essentially. 41 00:02:32,060 --> 00:02:35,118 But there was no one there I could talk to who had worked on this, 42 00:02:35,118 --> 00:02:38,596 because the field basically had disappeared in the meantime, right? 43 00:02:38,596 --> 00:02:40,616 Since 1980, nobody was working on this. 44 00:02:42,791 --> 00:02:45,363 And experimented with this a little bit, 45 00:02:45,363 --> 00:02:50,740 writing kind of simulation software of various kinds, reading about neuroscience. 46 00:02:52,090 --> 00:02:58,120 When I finished my engineering studies, I studied chip design. 47 00:02:58,120 --> 00:03:01,710 I'm good at site design at the time, so it's something completely different. 48 00:03:01,710 --> 00:03:05,340 And when I finished I really wanted to do research on this and 49 00:03:05,340 --> 00:03:09,442 I figured out already that at the time the important question was how you train 50 00:03:09,442 --> 00:03:10,630 neural nets with multiple layers. 51 00:03:10,630 --> 00:03:15,152 It was pretty clear in the literature of the 60s that that was the important 52 00:03:15,152 --> 00:03:20,040 question that had been left unsolved and their idea of hierarchy and everything. 53 00:03:20,040 --> 00:03:24,152 I'd read Fukushima's article on the neocognitron, right? 54 00:03:24,152 --> 00:03:29,951 Which was this sort of hierarchical architecture very similar to now what we 55 00:03:29,951 --> 00:03:36,765 now call convolutional nets, but without really backprop style learning algorithms. 56 00:03:36,765 --> 00:03:43,822 And I met people who were in a small independent club in France. 57 00:03:43,822 --> 00:03:47,796 They were interested in what they called at the time, Automata Networks. 58 00:03:47,796 --> 00:03:50,040 And they gave me a couple papers, 59 00:03:50,040 --> 00:03:54,782 the people on functional networks which is not very popular anymore. 60 00:03:54,782 --> 00:03:59,888 But it's the first associative memories with neural net and that paper can revive 61 00:03:59,888 --> 00:04:04,715 the interest of some research committees into neural net in the early 80s. 62 00:04:04,715 --> 00:04:09,732 Where by mostly physicists and condense matter physicists and 63 00:04:09,732 --> 00:04:14,558 a few psychologists, it was still not okay for engineers and 64 00:04:14,558 --> 00:04:18,464 computer scientists to talk about neural nets. 65 00:04:18,464 --> 00:04:23,566 And they also should be another paper that had just been distributed 66 00:04:23,566 --> 00:04:28,590 as a pre-print, whose title was Optimal Perceptual Inference. 67 00:04:28,590 --> 00:04:32,675 And this was the first paper on Boltzmann machines by Geoff Hinton and 68 00:04:32,675 --> 00:04:33,811 Terry Sejnowski. 69 00:04:33,811 --> 00:04:34,801 It was talking about hidden units. 70 00:04:34,801 --> 00:04:40,401 It was talking about, basically, the part of learning, 71 00:04:40,401 --> 00:04:46,702 multilayer neural nets are more capable than just classifiers. 72 00:04:46,702 --> 00:04:47,697 So I said, 73 00:04:47,697 --> 00:04:51,350 I need to meet these people [LAUGH]. >> Wow. 74 00:04:51,350 --> 00:04:52,093 >> Because they're only 75 00:04:52,093 --> 00:04:53,230 interested in the right problem. 76 00:04:54,756 --> 00:05:00,330 And a couple of years later, after I started my PhD, I participated in 77 00:05:00,330 --> 00:05:06,020 a workshop in Le Juch that was organized by the people I was working with. 78 00:05:06,020 --> 00:05:10,630 And Terry was one of the speakers at the workshop, so 79 00:05:10,630 --> 00:05:13,216 I met him at that time. >> It was like early 80s now. 80 00:05:13,216 --> 00:05:15,862 >> This is 1985, early 1985. 81 00:05:15,862 --> 00:05:19,777 So I met Terry Sejnowski in 1985 in the workshop in France in Le Juch and 82 00:05:19,777 --> 00:05:23,881 a lot of people were there, founders of early neural net, jump up field and, 83 00:05:23,881 --> 00:05:27,820 a lot of people working on theoretical neuroscience and stuff like that. 84 00:05:27,820 --> 00:05:29,930 It was a fascinating workshop. 85 00:05:29,930 --> 00:05:36,260 I met also, a couple of people from Bell Labs who eventually hired me at Bell Labs, 86 00:05:36,260 --> 00:05:38,590 but this was several years before I finished my PhD. 87 00:05:38,590 --> 00:05:42,821 So I talked to Terry Sejnowski and I was telling him about what I was working on 88 00:05:42,821 --> 00:05:45,479 which was some version of backprop at the time. 89 00:05:45,479 --> 00:05:49,559 This is before backprop was a paper and 90 00:05:49,559 --> 00:05:54,030 Terry was working on net talk at the time. 91 00:05:55,430 --> 00:05:57,212 This was before the Rumelhart, 92 00:05:57,212 --> 00:06:00,453 Hinton, Williams paper on backprop had been published. 93 00:06:00,453 --> 00:06:04,158 But he was friends with Geoff, this information was circulating, 94 00:06:04,158 --> 00:06:07,733 so he was already working on trying to make this work for net talk, 95 00:06:07,733 --> 00:06:10,804 but he didn't tell me. >> I see. 96 00:06:10,804 --> 00:06:11,848 >> And he went back to US and 97 00:06:11,848 --> 00:06:14,882 told Geoff there is some kid in France who's working on the same stuff 98 00:06:14,882 --> 00:06:16,672 we're working on. >> I see. 99 00:06:16,672 --> 00:06:19,117 >> [LAUGH] And then a few months later, 100 00:06:19,117 --> 00:06:25,120 in June, there was another conference in France where Geoff was a keynote speaker. 101 00:06:26,160 --> 00:06:28,250 And he gave a talk on Boltzmann machines. 102 00:06:28,250 --> 00:06:30,920 Of course, he was working on the backprop paper. 103 00:06:32,060 --> 00:06:34,230 And he gave this talk, and 104 00:06:34,230 --> 00:06:37,520 then there was 50 people around him who wanted to talk to him. 105 00:06:37,520 --> 00:06:40,210 And the first thing he said to the organizer is, 106 00:06:40,210 --> 00:06:41,380 do you know this guy, Yann LeCun? 107 00:06:41,380 --> 00:06:45,025 And it's because he had read my paper in the proceedings that was written 108 00:06:45,025 --> 00:06:45,900 in French. 109 00:06:45,900 --> 00:06:47,890 And he could sort of read French and he could see the math and 110 00:06:47,890 --> 00:06:51,600 he could figure out what sort of backprop, and so we had lunch together and 111 00:06:51,600 --> 00:06:53,496 that's how we became friends. >> I see, well. 112 00:06:53,496 --> 00:06:54,319 >> [LAUGH] 113 00:06:54,319 --> 00:06:56,583 >> So that's because multiple groups 114 00:06:56,583 --> 00:07:00,700 independently reinvented or invented backprop pretty much. 115 00:07:00,700 --> 00:07:02,906 >> Right, well, we realized that the whole 116 00:07:02,906 --> 00:07:06,878 idea with Chain Rule or what the optimal control people call the joint state 117 00:07:06,878 --> 00:07:10,870 method which is really the context in which backprop was really invented. 118 00:07:10,870 --> 00:07:14,105 This in context of optimal control back in the early 60s. 119 00:07:14,105 --> 00:07:19,649 This idea that you could use graded descent basically with multiple stages 120 00:07:19,649 --> 00:07:26,160 is what backprop really is and that popped up in various contexts at various times. 121 00:07:26,160 --> 00:07:30,849 And but I think the Rumelhart, Hinton, Williams paper is 122 00:07:30,849 --> 00:07:35,560 the one that popularized it. >> I see, yeah, no, cool, yeah. 123 00:07:35,560 --> 00:07:39,800 And then fast forward a few years, you wound up at AT&T Bell Labs, 124 00:07:39,800 --> 00:07:45,865 where you invented, among many things, the net, which we talk about in the course. 125 00:07:45,865 --> 00:07:49,301 And I remember when way back, I was a summer intern at AT&T Bell Labs, 126 00:07:49,301 --> 00:07:51,981 where I worked with Michael Kerns and a few others, and 127 00:07:51,981 --> 00:07:54,093 of hearing about your work even back then. 128 00:07:54,093 --> 00:07:57,509 So tell me more about your AT&T, the net, experience. 129 00:07:57,509 --> 00:07:58,810 >> Okay, so what happened is, 130 00:07:58,810 --> 00:08:04,400 I actually started working on convolutional net when I was A postdoc, 131 00:08:04,400 --> 00:08:06,090 University of Toronto, chief intern. 132 00:08:07,180 --> 00:08:09,075 I did the first experiment, I wrote the code there, and 133 00:08:09,075 --> 00:08:10,849 I did the first experiments there that showed that, 134 00:08:10,849 --> 00:08:11,953 if you had a very small data set. 135 00:08:11,953 --> 00:08:16,995 The data set I was training on, there was no or anything like that back then. 136 00:08:16,995 --> 00:08:20,195 So I drew a bunch of characters with my mouse. 137 00:08:20,195 --> 00:08:23,285 I had an Amiga, a personal computer, which was the best computer ever. 138 00:08:23,285 --> 00:08:27,425 And I drew a bunch of characters and then used that. 139 00:08:27,425 --> 00:08:30,345 I did augmentation to kind of increase it, and 140 00:08:30,345 --> 00:08:32,635 then used that as a way to test performance. 141 00:08:32,635 --> 00:08:35,125 And I compared things like fully connected nets, 142 00:08:35,125 --> 00:08:36,935 locally connected nets without shared weights. 143 00:08:36,935 --> 00:08:38,190 And then shared weight networks. 144 00:08:38,190 --> 00:08:40,000 Which was basically the first comment. 145 00:08:40,000 --> 00:08:46,090 And that worked really well for relatively small data sets, could show that you get 146 00:08:46,090 --> 00:08:50,320 better performance and no over-training with conventional architecture. 147 00:08:50,320 --> 00:08:53,481 And when I go to Bell Labs in October 1988, 148 00:08:53,481 --> 00:08:57,212 the first thing I did was first, scale up the network, 149 00:08:57,212 --> 00:09:01,935 because we had faster computers a few months before I go to Bell Labs. 150 00:09:01,935 --> 00:09:06,501 My boss at the time, Larry Jackal, who became a department head of 151 00:09:06,501 --> 00:09:09,920 said we should order a computer for you before you come. 152 00:09:09,920 --> 00:09:10,450 Where do you want? 153 00:09:10,450 --> 00:09:15,860 I say well, here Toronto, there is which was the stuff. 154 00:09:15,860 --> 00:09:17,432 It'd be great if we had one. 155 00:09:17,432 --> 00:09:21,450 And they ordered one and I had one for myself. 156 00:09:21,450 --> 00:09:24,660 At University of Toronto it was one for the entire department, right? 157 00:09:24,660 --> 00:09:25,840 One just for me, right? 158 00:09:25,840 --> 00:09:30,030 And so Larry told me he said, you know at Bell Labs you don't get famous by saving 159 00:09:30,030 --> 00:09:30,866 money. >> [LAUGH] 160 00:09:30,866 --> 00:09:34,483 >> So that was awesome, and they had been 161 00:09:34,483 --> 00:09:39,810 working already for awhile on character recognition. 162 00:09:39,810 --> 00:09:44,883 They had this enormous data set called USDS that had 5,000 training samples. 163 00:09:44,883 --> 00:09:51,700 [LAUGH] And so immediately I trained a net, which was in the net one, basically. 164 00:09:51,700 --> 00:09:53,645 And trained it on this data set and 165 00:09:53,645 --> 00:09:58,320 got really good results, better results than the other methods. 166 00:09:58,320 --> 00:10:04,150 They had tried on it, and that other people had tried on it is that so 167 00:10:04,150 --> 00:10:07,430 that, we knew we had something fairly early on. 168 00:10:07,430 --> 00:10:11,420 This was within three months of me joining Bell Labs. 169 00:10:11,420 --> 00:10:14,690 And so that was the first version of commercial net where 170 00:10:14,690 --> 00:10:19,580 we had a convolution with stride, and we did not have separate and 171 00:10:19,580 --> 00:10:21,240 pulling layers. >> Mm-hm. 172 00:10:21,240 --> 00:10:23,690 >> So each convolution was actually 173 00:10:23,690 --> 00:10:24,390 directly. 174 00:10:24,390 --> 00:10:25,050 And the reason for 175 00:10:25,050 --> 00:10:30,130 this is that we just could not afford to have a convolution at every location. 176 00:10:30,130 --> 00:10:32,299 There was just too much computation. >> I see. 177 00:10:32,299 --> 00:10:35,839 >> [COUGH] So, the second version had 178 00:10:35,839 --> 00:10:42,040 a separate convolution and pulling the air in something. 179 00:10:43,720 --> 00:10:47,070 I guess that's the one that's called one really. 180 00:10:47,070 --> 00:10:53,380 So we published a couple papers on this at competitions in Nips. 181 00:10:53,380 --> 00:10:57,270 And so, interesting story, did you ever talk to Nips about this work? 182 00:10:58,460 --> 00:11:01,580 And Jeffrey Ton was in the audience, and then you know I came back to my seat, 183 00:11:01,580 --> 00:11:05,920 I was sitting next to him and he said, there's one bit of information in your 184 00:11:05,920 --> 00:11:08,570 talk which is that, if you do all the sensible things, 185 00:11:08,570 --> 00:11:10,503 it actually works. >> [LAUGH] 186 00:11:10,503 --> 00:11:12,871 >> Then that showed the after deadline of 187 00:11:12,871 --> 00:11:16,820 work went on to make history because it became widely adopted. 188 00:11:16,820 --> 00:11:18,540 These ideas became widely adopted for 189 00:11:18,540 --> 00:11:20,570 reading cheques and- >> Yeah, 190 00:11:20,570 --> 00:11:26,200 the bigger value adopted within AT&T but not very much outside. 191 00:11:26,200 --> 00:11:29,368 And I think it's a little difficult for 192 00:11:29,368 --> 00:11:34,560 me to really understand why, but the simple factor [INAUDIBLE]. 193 00:11:34,560 --> 00:11:40,360 So this was back in the late 80s, and there was no Internet. 194 00:11:40,360 --> 00:11:45,110 We had email, we had FTP, but there was no Internet, really. 195 00:11:45,110 --> 00:11:48,450 No two labs were using the same software or hardware platform, right? 196 00:11:48,450 --> 00:11:51,270 Some people are at some workstations, others had other machines, 197 00:11:51,270 --> 00:11:52,980 some people were using PCs or whatever. 198 00:11:52,980 --> 00:11:56,360 There was no such thing as Python or MATLAB or anything like that, right? 199 00:11:56,360 --> 00:11:58,430 People are writing their own code. 200 00:11:58,430 --> 00:12:01,510 I had spent a year and a half basically writing, 201 00:12:01,510 --> 00:12:04,442 me and when he was still a student. 202 00:12:04,442 --> 00:12:07,330 We're working together, and we spent a year and 203 00:12:07,330 --> 00:12:10,580 a half basically just writing a neural net simulator. 204 00:12:12,030 --> 00:12:14,360 And at the time because there was no match-up with Python. 205 00:12:14,360 --> 00:12:16,040 You had to kind of write your own interpreter, right? 206 00:12:16,040 --> 00:12:16,920 To kind of control it. 207 00:12:16,920 --> 00:12:19,070 So we want our own list of interpreter. 208 00:12:19,070 --> 00:12:24,160 And so all the networks written in list using a numerical back hand. 209 00:12:24,160 --> 00:12:27,830 Very similar to what we have now with blocks that you can interconnect and 210 00:12:27,830 --> 00:12:31,380 instead of many differentiation and all that stuff that we;re familiar now, 211 00:12:31,380 --> 00:12:36,250 with torsion by torsion, tensile flow and all those things. 212 00:12:37,400 --> 00:12:41,160 So then we developed a bunch of applications. 213 00:12:41,160 --> 00:12:45,110 We got together with a group of engineers. 214 00:12:46,460 --> 00:12:47,440 Very smart people. 215 00:12:48,880 --> 00:12:52,230 Some of them were like theoretical 216 00:12:52,230 --> 00:12:56,186 physicists who kind of turned engineer at the Bell Labs. 217 00:12:57,280 --> 00:13:00,070 Chris Dodgers was one of them who then had to 218 00:13:01,900 --> 00:13:04,095 distinguished career at Microsoft research afterwards. 219 00:13:04,095 --> 00:13:04,766 And Krieg Nolan. 220 00:13:04,766 --> 00:13:09,447 But keep on and we're collaborating with them to kind of make this 221 00:13:09,447 --> 00:13:12,300 technology practical. >> I see. 222 00:13:12,300 --> 00:13:12,830 >> And so 223 00:13:12,830 --> 00:13:17,840 together we developed this characterization systems. 224 00:13:17,840 --> 00:13:22,614 And that meant integrating, convolutional net with things like, 225 00:13:22,614 --> 00:13:27,471 similar to things like we now call CRFs for interpreting sequences of 226 00:13:27,471 --> 00:13:30,830 characters not just individual address. >> Yeah, 227 00:13:30,830 --> 00:13:33,710 right to the net paper had partially under neural network and 228 00:13:33,710 --> 00:13:37,630 partially under atomic machinery >> Right, to pull it together? 229 00:13:37,630 --> 00:13:38,230 >> Yeah, that's right. 230 00:13:38,230 --> 00:13:41,310 And so the first half on the paper is on convolutional nets, and 231 00:13:41,310 --> 00:13:43,080 the paper is mostly cited for that. 232 00:13:43,080 --> 00:13:45,205 And then the second half, very few people have read it, 233 00:13:45,205 --> 00:13:49,530 [LAUGH] and it's about sort of sequence level, discriminative running, and 234 00:13:49,530 --> 00:13:53,900 basically structure prediction with that normalization. 235 00:13:53,900 --> 00:13:57,060 So it's very similar to CRF, in fact. >> Fascinating 236 00:13:57,060 --> 00:14:00,790 >> You know with PTCRFS over the years. 237 00:14:00,790 --> 00:14:06,770 So that was very successful, except that the day we were 238 00:14:08,290 --> 00:14:12,300 celebrating the deployment of that system in major bank, 239 00:14:13,450 --> 00:14:17,708 we worked with this group that I was mentioning that was kind of 240 00:14:17,708 --> 00:14:19,640 doing the engineering of the whole system. 241 00:14:19,640 --> 00:14:23,650 And then another product group in a different part of the country 242 00:14:23,650 --> 00:14:25,940 that belonged to a subsidiary of AT&T called NCR. 243 00:14:25,940 --> 00:14:26,480 So this is the- >> [CROSSTALK] 244 00:14:26,480 --> 00:14:29,280 >> National Cash Register, right. 245 00:14:29,280 --> 00:14:32,610 They also build large ATM machines, and 246 00:14:32,610 --> 00:14:36,250 they build large check reading machines for banks. 247 00:14:36,250 --> 00:14:38,220 So they were the customers, if you want. 248 00:14:38,220 --> 00:14:41,100 They were using our check billing systems. 249 00:14:41,100 --> 00:14:43,570 And they had deployed it in a bank. 250 00:14:43,570 --> 00:14:45,300 I can't remember which bank it was. 251 00:14:45,300 --> 00:14:47,490 They deployed those, so there were ATM machines in a French book. 252 00:14:47,490 --> 00:14:51,979 So they could read the check you would deposit, and we were all at 253 00:14:51,979 --> 00:14:56,804 a fancy restaurant celebrating the department of this thing where, 254 00:14:56,804 --> 00:15:00,982 when the company announced that it was breaking itself up. 255 00:15:00,982 --> 00:15:02,311 So this was 1995 and 256 00:15:02,311 --> 00:15:06,970 AT&T announced that it was breaking itself up into two companies. 257 00:15:06,970 --> 00:15:11,370 So there was AT&T, and then there was Lucen Technologies, and NCR. 258 00:15:11,370 --> 00:15:14,670 So NCR was spun off, and Lucent Technologies was spun off. 259 00:15:14,670 --> 00:15:17,090 And the engineering group went with Lucent Technologies, and the product group, 260 00:15:17,090 --> 00:15:18,210 of course, went with NCR. 261 00:15:19,770 --> 00:15:24,620 And the sad thing is that the AT&T lawyers in their infinite wisdom 262 00:15:24,620 --> 00:15:29,090 assigned the patents, there was a patent on covolutional net 263 00:15:29,090 --> 00:15:31,991 which is thankfully expired. >> I see [LAUGH]. 264 00:15:31,991 --> 00:15:33,650 >> [LAUGH] Expired in 2007. 265 00:15:33,650 --> 00:15:36,370 About ten years ago. 266 00:15:36,370 --> 00:15:41,100 And they signed patent to NCR, but there was nobody in NCR who actually 267 00:15:41,100 --> 00:15:44,710 knew even what a convolutional net was really. 268 00:15:44,710 --> 00:15:48,470 And so the patent was in the hands of people who had no idea what they had. 269 00:15:48,470 --> 00:15:51,254 And we were in a different company that now could not really develop 270 00:15:51,254 --> 00:15:54,187 the technology, and our engineering team was in a separate company, 271 00:15:54,187 --> 00:15:56,724 because we went with AT&T and engineering went with Lucent, and 272 00:15:56,724 --> 00:15:58,140 the product group went with NCR. 273 00:15:58,140 --> 00:16:04,190 So it was a little depressing [LAUGH]. >> So in addition to your early work, 274 00:16:04,190 --> 00:16:08,980 when your networks were Part, you kept persisting on neural networks 275 00:16:08,980 --> 00:16:12,020 even when there was some sort of winter for neural net. 276 00:16:12,020 --> 00:16:15,126 So what was like that? >> Well, so I persisted and 277 00:16:15,126 --> 00:16:16,884 didn't persist in some ways. 278 00:16:16,884 --> 00:16:21,555 I was always convinced that eventually those techniques would come back to 279 00:16:21,555 --> 00:16:26,369 the fore, and sort of people would figure out how to use them in practice, and 280 00:16:26,369 --> 00:16:27,650 it would be useful. 281 00:16:27,650 --> 00:16:30,901 So I always had that in the back of my mind. 282 00:16:30,901 --> 00:16:33,661 But in 1996, when AT&T broke itself up, and 283 00:16:33,661 --> 00:16:36,750 all of our work on character recognition, basically, 284 00:16:36,750 --> 00:16:40,627 was kind of broken up because the part of the group went in separate way, 285 00:16:40,627 --> 00:16:45,490 I was also promoted to department head, and I had to figure out what to work on. 286 00:16:45,490 --> 00:16:49,597 And this was the early days of the Internet, we're talking 1995. 287 00:16:49,597 --> 00:16:53,836 And I had the idea somehow that one big problem about 288 00:16:53,836 --> 00:16:58,175 the emergence of the Internet was going to be to bring 289 00:16:58,175 --> 00:17:03,120 all the knowledge that we had on paper to the digital world. 290 00:17:03,120 --> 00:17:07,193 And so I started, actually, a project called DjVu, D-J-V-U, 291 00:17:07,193 --> 00:17:10,694 which was to compress scanned documents, essentially, 292 00:17:10,694 --> 00:17:13,635 so they could be distributed over the Internet. 293 00:17:13,635 --> 00:17:17,528 And this project was really fun for a while, and had some success, although AT&T 294 00:17:17,528 --> 00:17:21,443 really didn't know what to do with it. >> Yeah, I remember that, really helping 295 00:17:21,443 --> 00:17:24,790 dissemination of online research papers. >> Yeah, that's right, exactly. 296 00:17:24,790 --> 00:17:28,830 And we scanned the entire proceedings of NIPS, and we made them available online- 297 00:17:28,830 --> 00:17:30,110 >> I see, I remember that. 298 00:17:30,110 --> 00:17:31,590 >> To kind of demonstrate how that worked. 299 00:17:31,590 --> 00:17:35,736 And we could compress high resolution pages to just a few kilobytes. 300 00:17:35,736 --> 00:17:36,502 >> So ConvNet, 301 00:17:36,502 --> 00:17:39,988 starting from some of your much earlier work has now come and 302 00:17:39,988 --> 00:17:43,336 pretty much taken over the field of computer vision, and 303 00:17:43,336 --> 00:17:46,980 starting to encroach significantly into even other fields. 304 00:17:46,980 --> 00:17:50,407 So just tell me about how you saw that whole process. 305 00:17:50,407 --> 00:17:51,446 >> [LAUGH] So 306 00:17:51,446 --> 00:17:55,150 to tell you how I thought this was going to happen early on. 307 00:17:55,150 --> 00:17:59,178 So first of all, I always believed that this was going to work. 308 00:17:59,178 --> 00:18:04,074 It required fast computers and lots of data, but I always believed, somehow, 309 00:18:04,074 --> 00:18:07,160 that this was going to be the right thing to do. 310 00:18:07,160 --> 00:18:11,695 What I thought, originally, when I was at Bell Labs, that there was going to be some 311 00:18:11,695 --> 00:18:16,392 sort of continuous progress along these directions as machines got more powerful. 312 00:18:16,392 --> 00:18:20,874 And we were even designing chips to run convolutional nets at Bell Labs, but 313 00:18:20,874 --> 00:18:25,566 now those are actually in hospital graph separately had two different chips for 314 00:18:25,566 --> 00:18:28,593 running convolutional nets really efficiently. 315 00:18:28,593 --> 00:18:33,186 And so we thought there was going to be a kind of a pick up of this, and 316 00:18:33,186 --> 00:18:37,882 kind of growing interest and sort of continuous progress for it. 317 00:18:37,882 --> 00:18:41,860 But in fact, because of the sort of interest for neural nets, 318 00:18:41,860 --> 00:18:45,470 sort of dying in the mid-90s, that didn't happen. 319 00:18:45,470 --> 00:18:51,444 So there was kind of a dark period of six or seven years between 1995 roughly and 320 00:18:51,444 --> 00:18:55,351 2002 when basically nobody was working on this. 321 00:18:55,351 --> 00:18:57,192 In fact, there was a little bit of work. 322 00:18:57,192 --> 00:19:01,971 There was some work at Microsoft in the early 2000s on using 323 00:19:01,971 --> 00:19:06,401 convolutional nets for Chinese character recognition. 324 00:19:08,676 --> 00:19:11,676 >> Group, yeah, exactly. 325 00:19:11,676 --> 00:19:14,844 And there was some other small work for face detection and 326 00:19:14,844 --> 00:19:19,780 things like this in France, and in various other places, but it was very small. 327 00:19:19,780 --> 00:19:24,400 I discovered actually recently that a couple groups that came 328 00:19:24,400 --> 00:19:27,320 up with ideas that are essentially very similar to convolutional nets, but 329 00:19:27,320 --> 00:19:31,370 never quite published it the same way for medical image analysis. 330 00:19:31,370 --> 00:19:33,880 And those were mostly in the context of commercial systems. 331 00:19:33,880 --> 00:19:37,310 And so it never quite made it to the population. 332 00:19:37,310 --> 00:19:42,343 I mean, it was after our first work on convolutional nets, and they were 333 00:19:42,343 --> 00:19:47,475 not really aware of it, but it sort of developed in parallel a little bit. 334 00:19:47,475 --> 00:19:52,764 So several people got kind of similar ideas several years interval. 335 00:19:52,764 --> 00:19:56,950 But then I was really surprised by how fast 336 00:19:56,950 --> 00:20:01,250 interest picked up after the ImageNet- >> 2012 337 00:20:01,250 --> 00:20:03,646 >> In 2012, so it's the end of 2012. 338 00:20:03,646 --> 00:20:07,707 It was kind of a very interesting event at ECCV, 339 00:20:07,707 --> 00:20:12,389 in Florence, where there was a workshop on ImageNet. 340 00:20:12,389 --> 00:20:19,552 And they already knew that had won by a large margin. 341 00:20:19,552 --> 00:20:21,004 And so everybody was waiting for talk. 342 00:20:21,004 --> 00:20:25,717 And most people in the computer vision community had no idea what a convolutional 343 00:20:25,717 --> 00:20:26,281 net was. 344 00:20:26,281 --> 00:20:27,210 I mean, they heard me talk about it. 345 00:20:27,210 --> 00:20:32,181 I actually had an invited talk at CVPR in 2000 where I talked about it, 346 00:20:32,181 --> 00:20:35,560 but most people had not paid much attention to it. 347 00:20:35,560 --> 00:20:37,822 Senior people did, they knew what it was, but 348 00:20:37,822 --> 00:20:41,607 the more junior people in the community were really, had no idea what it was. 349 00:20:41,607 --> 00:20:45,654 And so just gives his talk, and he doesn't explain what a convolutional net is 350 00:20:45,654 --> 00:20:47,824 because he assumes everybody knows, right? 351 00:20:47,824 --> 00:20:53,093 because he comes from a so he says, here's how everything is connected, 352 00:20:53,093 --> 00:20:56,753 and how we transform the data and what results we get. 353 00:20:56,753 --> 00:20:59,450 Again, assuming that everybody knows what it is. 354 00:20:59,450 --> 00:21:02,198 And a lot of people are incredibly surprised. 355 00:21:02,198 --> 00:21:07,112 And you could see the opinion of people changing as he was kind of giving 356 00:21:07,112 --> 00:21:11,946 his talk, very senior people in the field. >> So you think that workshop was 357 00:21:11,946 --> 00:21:16,058 a defining moment that swayed a lot of the computer vision community. 358 00:21:16,058 --> 00:21:16,724 >> Yeah, definitely. 359 00:21:16,724 --> 00:21:17,572 >> That's right, yeah. 360 00:21:17,572 --> 00:21:18,874 >> That's the way it happened, yeah, 361 00:21:18,874 --> 00:21:23,370 right there. >> So today, you retain a faculty position 362 00:21:23,370 --> 00:21:27,998 at NYU, and you also lead FAIR, Facebook AI Research. 363 00:21:27,998 --> 00:21:32,241 I know you have a pretty unique point of view on how corporate research should 364 00:21:32,241 --> 00:21:33,230 be done. 365 00:21:33,230 --> 00:21:34,530 Do you want to share your thoughts on that? 366 00:21:34,530 --> 00:21:37,688 >> Yeah, so I mean, one of the beautiful 367 00:21:37,688 --> 00:21:44,105 things that I managed to do at Facebook in the last four years is that I was given 368 00:21:44,105 --> 00:21:50,128 a lot of freedom to setup FAIR the way I thought was the most appropriate, 369 00:21:50,128 --> 00:21:56,010 because this was the first research organization within Facebook. 370 00:21:56,010 --> 00:21:58,910 Facebook is a sort of engineering-centric company. 371 00:21:58,910 --> 00:22:03,007 And so far was really focused on sort of survival or short-term things. 372 00:22:03,007 --> 00:22:10,714 And Facebook was about to turn ten years old, had a successful IPO. 373 00:22:10,714 --> 00:22:14,220 And was basically thinking about the next ten years, right? 374 00:22:14,220 --> 00:22:18,188 I mean, Mark Zuckerberg was thinking, what is going to be important for 375 00:22:18,188 --> 00:22:19,341 the next ten years? 376 00:22:19,341 --> 00:22:21,917 And the survival of the company was not in question anymore. 377 00:22:21,917 --> 00:22:26,343 So this is the kind of transition where a large company can start to think, or 378 00:22:26,343 --> 00:22:28,846 it was not such a large company at the time. 379 00:22:28,846 --> 00:22:34,003 Facebook had 5,000 employees or so, but it had the luxury to 380 00:22:34,003 --> 00:22:39,837 think about the next ten years and what would be important in technology. 381 00:22:39,837 --> 00:22:45,069 And Mark and his team decided that AI was going to be a crucial 382 00:22:45,069 --> 00:22:52,372 piece of technology for connecting people, which is the mission of Facebook. 383 00:22:52,372 --> 00:22:55,303 And so they explored several ways to kind of build an effort in AI. 384 00:22:55,303 --> 00:22:57,808 They had a small internal group, engineering group, 385 00:22:57,808 --> 00:23:01,459 experimenting with convolutional nets and stuff that were getting really good 386 00:23:01,459 --> 00:23:05,450 results in face recognition and various other things, which peaked their interest. 387 00:23:05,450 --> 00:23:08,724 And they explored the idea of hiring a bunch of young researchers, or 388 00:23:08,724 --> 00:23:10,820 acquiring a company, or things like this. 389 00:23:10,820 --> 00:23:14,200 And they settled on the idea of hiring someone senior in the field, and 390 00:23:14,200 --> 00:23:18,097 then kind of setting up a research organization. 391 00:23:20,210 --> 00:23:23,340 And it was a bit of a culture shock, initially, because 392 00:23:23,340 --> 00:23:26,750 the way research operates in the company is very different from engineering, right? 393 00:23:26,750 --> 00:23:29,250 You have longer time scales and horizon. 394 00:23:29,250 --> 00:23:32,672 And researchers tend to be very conservative about the choice of places 395 00:23:32,672 --> 00:23:33,821 where they want to work. 396 00:23:33,821 --> 00:23:38,552 And I made it very clear very early on that research needs to be open, 397 00:23:38,552 --> 00:23:43,034 that researchers need to not only be encouraged to publish, but 398 00:23:43,034 --> 00:23:45,110 be even mandated to publish. 399 00:23:45,110 --> 00:23:50,970 And also be evaluated on criteria that are similar to 400 00:23:50,970 --> 00:23:56,440 what we used to evaluate academic researchers. 401 00:23:56,440 --> 00:24:01,644 [COUGH] And so what Mark and Mike Schroepfer, the CTO of the company, 402 00:24:01,644 --> 00:24:07,140 who is my boss now, said, they said, Facebook is a very open company. 403 00:24:07,140 --> 00:24:09,890 We distribute a lot of stuff in open source. 404 00:24:13,188 --> 00:24:14,799 Schroepfer, the CTO, 405 00:24:14,799 --> 00:24:17,910 comes from the open source world. >> Mozilla. 406 00:24:17,910 --> 00:24:19,890 >> He was from Mozilla before that, 407 00:24:19,890 --> 00:24:21,260 and a lot of people came from that world. 408 00:24:21,260 --> 00:24:24,440 So that was in the DNA of the company, so that made me 409 00:24:24,440 --> 00:24:28,390 very confident that we could kind of set up an open research organization. 410 00:24:28,390 --> 00:24:34,941 And then the fact that the company is not obsessively compulsive about IP as some 411 00:24:34,941 --> 00:24:41,397 other companies are makes it much easier to collaborate with universities and 412 00:24:41,397 --> 00:24:46,774 have arrangements by which a person can have a foot in industry and 413 00:24:46,774 --> 00:24:49,555 a foot in academia. >> And you find that valuable, 414 00:24:49,555 --> 00:24:52,630 yourself? >> Absolutely, yes. 415 00:24:52,630 --> 00:24:56,261 Yeah, so if you look at my publications over the last four years, 416 00:24:56,261 --> 00:24:59,696 the vast majority of them are publications with my students at 417 00:24:59,696 --> 00:25:01,170 NYU. >> I see. 418 00:25:01,170 --> 00:25:03,190 >> Because at Facebook, 419 00:25:03,190 --> 00:25:07,016 I did a lot of organizing the lab, hiring, 420 00:25:07,016 --> 00:25:12,029 set the direction and advising, and things like this. 421 00:25:12,029 --> 00:25:16,345 But I don't get involved in individual research projects to get my name on 422 00:25:16,345 --> 00:25:16,910 papers. 423 00:25:16,910 --> 00:25:20,478 And I don't care to get my name on papers anymore, but it's- 424 00:25:20,478 --> 00:25:21,666 >> It's not sending out someone else to do 425 00:25:21,666 --> 00:25:23,580 your dirty work rather than doing all the dirty work yourself. 426 00:25:23,580 --> 00:25:24,590 >> Exactly, and you never want to put 427 00:25:24,590 --> 00:25:27,390 yourself, you want to stay behind the scene. 428 00:25:27,390 --> 00:25:30,539 You don't want to put yourself in competition with people in your lab in 429 00:25:30,539 --> 00:25:32,721 that case. >> I'm sure you get asked this a lot but 430 00:25:32,721 --> 00:25:35,760 hoping you answer for all the people watching this video as well. 431 00:25:36,830 --> 00:25:40,719 What advice do you have for someone wanting to get involved in the, 432 00:25:40,719 --> 00:25:42,459 break into AI? >> [LAUGH] I mean, 433 00:25:42,459 --> 00:25:46,470 it's such a different world now than when it was when I got started. 434 00:25:46,470 --> 00:25:51,820 But I think what's great now is it's very easy for people to get involved at some 435 00:25:51,820 --> 00:25:57,030 level, the tools that are available are so easy to use now, in terms of whatever. 436 00:25:57,030 --> 00:26:01,928 You can have a run through on the cheap computer in your bedroom, [LAUGH] and 437 00:26:01,928 --> 00:26:06,905 basically train your conventional net or your current net to do whatever, 438 00:26:06,905 --> 00:26:09,140 and there's a lot of tools. 439 00:26:09,140 --> 00:26:16,190 You can learn a lot from online material about this without, it's not very onerous. 440 00:26:16,190 --> 00:26:19,860 So you see high school students now playing with this right? 441 00:26:19,860 --> 00:26:24,930 Which is kind of great, I think and they certainly are growing interest 442 00:26:24,930 --> 00:26:29,730 from the student population to learn about machine learning and AI and 443 00:26:29,730 --> 00:26:36,820 it's very exciting for young people and I find that wonderful I think. 444 00:26:36,820 --> 00:26:42,430 So my advice is, if you want to get into this, make yourself useful. 445 00:26:42,430 --> 00:26:45,260 So make a contribution to an open source project, for example. 446 00:26:45,260 --> 00:26:49,810 Or make an implementation of some standard algorithm that you can't find the code of 447 00:26:49,810 --> 00:26:54,600 online, but you'd like to make it available to other people. 448 00:26:54,600 --> 00:26:56,610 So take a paper that you think is important, 449 00:26:56,610 --> 00:27:01,080 and then re-implement the algorithm, and then put it open source package, 450 00:27:01,080 --> 00:27:04,260 or contribute to one of those open source packages. 451 00:27:04,260 --> 00:27:09,132 And if the stuff you write is interesting and useful, you'll get noticed. 452 00:27:09,132 --> 00:27:14,030 Maybe you'll get a nice job at a company you really wanted a job at, 453 00:27:14,030 --> 00:27:18,580 or maybe you'll get accepted in your favorite PhD program or things like this. 454 00:27:18,580 --> 00:27:19,950 So I think that's a good way to get 455 00:27:19,950 --> 00:27:20,962 started. >> So 456 00:27:20,962 --> 00:27:24,973 open source contributions is a good way to enter the community, give back to learn. 457 00:27:24,973 --> 00:27:26,368 >> Yeah, that's right, 458 00:27:26,368 --> 00:27:29,651 that's right. >> Thanks a lot Jan that was fascinating, 459 00:27:29,651 --> 00:27:32,520 I've known you for many years and it's still fascinating to hear all these 460 00:27:32,520 --> 00:27:34,813 details of all the stories that have gone in over the years. 461 00:27:34,813 --> 00:27:37,248 >> Yeah, there's many stories like this 462 00:27:37,248 --> 00:27:41,895 that, reflecting back at the moment when they happen you don't realize, 463 00:27:41,895 --> 00:27:45,380 what importance it might take 10 or 20 years later. 464 00:27:45,380 --> 00:27:47,113 >> Yeah, thank you. 465 00:27:47,113 --> 00:27:48,678 >> Thanks.