1 00:00:03,182 --> 00:00:06,580 Welcome, Rus, I'm really glad you could join us here today. 2 00:00:06,580 --> 00:00:08,370 >> Thank you, thank you Andrew. 3 00:00:08,370 --> 00:00:11,696 >> So today you're the director of research at Apple, and 4 00:00:11,696 --> 00:00:16,720 you also have a faculty, a professor for Carnegie Mellon University. 5 00:00:16,720 --> 00:00:20,050 So I'd love to hear a bit about your personal story. 6 00:00:20,050 --> 00:00:24,350 How did you end up doing this deep learning work that you do? 7 00:00:24,350 --> 00:00:27,930 Yeah, it's actually, to some extent it was, 8 00:00:27,930 --> 00:00:32,040 I started in deep learning to some extent by luck. 9 00:00:32,040 --> 00:00:35,710 I did my master's degree at Toronto, and then I took a year off. 10 00:00:35,710 --> 00:00:37,860 I was actually working in the financial sector. 11 00:00:37,860 --> 00:00:40,161 It's a little bit surprising. 12 00:00:40,161 --> 00:00:44,047 And at that time, I wasn't quite sure whether I want to go for my PhD or not. 13 00:00:44,047 --> 00:00:46,110 And then something happened, something surprising happened. 14 00:00:46,110 --> 00:00:50,641 I was going to work one morning, and I bumped into Geoff Hinton. 15 00:00:50,641 --> 00:00:55,153 And Geoff told me, hey, I have this terrific idea. 16 00:00:55,153 --> 00:00:56,810 Come to my office, I'll show you. 17 00:00:56,810 --> 00:01:01,205 And so, we basically walked together and he started telling me about these 18 00:01:01,205 --> 00:01:06,400 Boltzmann Machines and contrasting divergence, and some of the tricks which 19 00:01:06,400 --> 00:01:09,080 I didn't at that time quite understand what he was talking about. 20 00:01:10,300 --> 00:01:15,320 But that really, really excited, that was very exciting and really excited me. 21 00:01:15,320 --> 00:01:20,340 And then basically, within three months I started my PhD with Geoff. 22 00:01:21,410 --> 00:01:28,743 So that was kind of the beginning, because that was back in 2005, 2006. 23 00:01:28,743 --> 00:01:32,766 And this is where some of the original deep learning algorithms, using 24 00:01:32,766 --> 00:01:37,810 Restricted Boltz Machines, unsupervised pre-training, were kind of popping up. 25 00:01:37,810 --> 00:01:41,969 And so that's how I started it, was really. 26 00:01:41,969 --> 00:01:46,181 That one particular morning when I bumped into Geoff completely 27 00:01:46,181 --> 00:01:48,990 changed my future career moving forward. 28 00:01:48,990 --> 00:01:52,374 >> And then in fact you were co-author on one of the very early 29 00:01:52,374 --> 00:01:55,070 papers on Restricted Boltzmann Machines that 30 00:01:55,070 --> 00:02:00,430 really helped with this resurgence of neural networks and deep learning. 31 00:02:00,430 --> 00:02:03,359 Tell me a bit more what that was like working on that 32 00:02:03,359 --> 00:02:06,217 seminal- >> Yeah, this was actually a really, this 33 00:02:06,217 --> 00:02:10,992 was exciting, yeah, it was the first year, it was my first year as a PGD student. 34 00:02:10,992 --> 00:02:11,960 And Geoff and 35 00:02:11,960 --> 00:02:17,505 I were trying to explore these ideas of using Restricted Boltz Machines and 36 00:02:17,505 --> 00:02:21,680 using pre-training tricks to train multiple layers. 37 00:02:21,680 --> 00:02:25,934 And specifically we were trying to focus on auto-encoders, 38 00:02:25,934 --> 00:02:29,880 how do we do a non-linear extension of PCA effectively? 39 00:02:29,880 --> 00:02:34,416 And it was very exciting, because we got these systems to work on 40 00:02:34,416 --> 00:02:37,296 which was exciting, but then the next steps for 41 00:02:37,296 --> 00:02:42,062 us were to really see whether we can extend these models to dealing with faces. 42 00:02:42,062 --> 00:02:45,069 I remember we had this Olivetti faces dataset. 43 00:02:45,069 --> 00:02:48,000 And then we started looking at can we do compression for documents? 44 00:02:48,000 --> 00:02:52,576 And we started looking at all these different data, 45 00:02:52,576 --> 00:02:57,152 real value count, binary, and throughout a year, 46 00:02:57,152 --> 00:03:03,672 I was a first year PhD student, so it was a big learning experience for me. 47 00:03:03,672 --> 00:03:06,310 But and really within six or seven months, 48 00:03:06,310 --> 00:03:11,079 we were able to get really interesting results, I mean really good results. 49 00:03:11,079 --> 00:03:14,765 I think that we were able to train these very deep auto-encoders. 50 00:03:14,765 --> 00:03:19,195 This is something that you couldn't do at that time using sort of 51 00:03:19,195 --> 00:03:22,077 traditional optimization techniques. 52 00:03:22,077 --> 00:03:26,965 And then it turned out into really, really exciting period for us. 53 00:03:27,970 --> 00:03:33,520 That was super exciting, yeah, because it was a lot of learning for me, 54 00:03:33,520 --> 00:03:38,230 but at the same time, the results turned out to be really, 55 00:03:38,230 --> 00:03:40,710 really impressive for what we were trying to do. 56 00:03:42,210 --> 00:03:45,360 >> So in the early days of those researches of deep learning, 57 00:03:45,360 --> 00:03:49,760 a lot of the activity was centered on Restricted Boltzmann Machines, and 58 00:03:49,760 --> 00:03:52,340 then Deep Boltzmann Machines. 59 00:03:52,340 --> 00:03:54,842 There's still a lot of exciting research there being done, 60 00:03:54,842 --> 00:03:58,228 including some in your group, but what's happening with Boltzmann Machines and 61 00:03:58,228 --> 00:03:59,663 Restricted Boltzmann Machines? 62 00:03:59,663 --> 00:04:00,900 >> Yeah, that's a very good question. 63 00:04:00,900 --> 00:04:01,710 I think that, 64 00:04:01,710 --> 00:04:07,032 in the early days, the way that we were using Restricted Boltz Machines 65 00:04:07,032 --> 00:04:10,940 is you sort of can imagine training a stack of these Restricted Boltz Machines 66 00:04:10,940 --> 00:04:14,715 that would allow you to learn effectively one layer at a time. 67 00:04:14,715 --> 00:04:17,928 And there's a good theory behind when you add a particular layer, 68 00:04:17,928 --> 00:04:21,010 it can prove variational bound and so forth under certain conditions. 69 00:04:21,010 --> 00:04:24,414 So there was a theoretical justification, and these models 70 00:04:24,414 --> 00:04:28,697 were working quite well in terms of being able to pre-train these systems. 71 00:04:28,697 --> 00:04:35,013 And then around 2009, 2010, once the Computes started showing up, 72 00:04:35,013 --> 00:04:41,618 GPUs, then a lot of us started realizing that actually directly optimizing these 73 00:04:41,618 --> 00:04:47,760 deep neural networks was giving similar results or even better results. 74 00:04:47,760 --> 00:04:50,336 >> So just standard backprop without the pre-training or 75 00:04:50,336 --> 00:04:51,824 the Restricted Boltz Machine. 76 00:04:51,824 --> 00:04:52,700 >> That's right, that's right. 77 00:04:52,700 --> 00:04:56,180 And that's sort of over three or four years, and it was exciting for 78 00:04:56,180 --> 00:04:58,050 the whole community, because people felt that wow, 79 00:04:58,050 --> 00:05:02,460 you can actually train these deep models using these pre-training mechanisms. 80 00:05:02,460 --> 00:05:06,930 And then, with more Computes people started realizing that you can just 81 00:05:06,930 --> 00:05:10,580 basically do standard backpropagation, something that we couldn't do 82 00:05:11,660 --> 00:05:17,240 back in 2005 or 2004, because it would take us months to do it on CPU's. 83 00:05:17,240 --> 00:05:20,330 And so that was a big change. 84 00:05:20,330 --> 00:05:24,050 The other thing that I think that we haven't really figured out 85 00:05:24,050 --> 00:05:28,108 what to do with Boltz Machines and Deep Boltzmann Machines. 86 00:05:28,108 --> 00:05:29,650 I believe they're very powerful models, 87 00:05:29,650 --> 00:05:32,010 because you can think of them generative models. 88 00:05:32,010 --> 00:05:36,200 They're trying to model coupling distributions in the data, but 89 00:05:36,200 --> 00:05:40,440 when we start looking at learning algorithms, learning algorithms right now, 90 00:05:40,440 --> 00:05:45,000 they require using Markov Chain Monte Carlo and variational learning and such, 91 00:05:45,000 --> 00:05:50,170 which is not as scalable as backpropagation algorithms. 92 00:05:50,170 --> 00:05:55,480 So we yet have to figure out more efficient ways of training these models, 93 00:05:55,480 --> 00:05:57,920 and also the use of convolution, 94 00:05:57,920 --> 00:06:02,830 it's something that's fairly difficult to integrate into these models. 95 00:06:02,830 --> 00:06:07,500 I remember some of your work on using probabilistic max pooling for 96 00:06:07,500 --> 00:06:12,020 sort of building these generative models of different objects, and 97 00:06:12,020 --> 00:06:16,970 using these ideas of convolution was also very, very exciting, but 98 00:06:16,970 --> 00:06:20,630 at the same time, it's still extremely hard to train these models, so. 99 00:06:20,630 --> 00:06:21,488 >> How likely is work? 100 00:06:21,488 --> 00:06:22,720 >> Yes, how likely is work, right? 101 00:06:22,720 --> 00:06:24,990 And so we still have to figure out where. 102 00:06:27,750 --> 00:06:31,185 On the other side, some of the recent work using variational auto-encoders, for 103 00:06:31,185 --> 00:06:35,260 example, which could be viewed as interactive versions of Bolzmann Machines. 104 00:06:35,260 --> 00:06:40,491 We have figured out ways of training these modules, a work by Max Welling and 105 00:06:40,491 --> 00:06:44,800 Diederik Kingma, on using reparameterization tricks. 106 00:06:44,800 --> 00:06:50,720 And now we can use backpropagation algorithm within the stochastic system, 107 00:06:50,720 --> 00:06:53,630 which is driving a lot of progress right now. 108 00:06:53,630 --> 00:06:59,199 But we haven't quite figured out how to do that in the case of Boltzmann Machines. 109 00:06:59,199 --> 00:07:04,606 >> So that's actually a very interesting perspective I actually wasn't aware of, 110 00:07:04,606 --> 00:07:09,382 which is that in an early era where computers were slower, that the RBM, 111 00:07:09,382 --> 00:07:11,969 pre-training was really important, 112 00:07:11,969 --> 00:07:17,920 it was only faster computation that drove switching to standard backprop. 113 00:07:17,920 --> 00:07:21,370 In terms of the evolution of the community's thinking in deep learning and 114 00:07:21,370 --> 00:07:24,712 other topics, I know you spend a lot time thinking about this, 115 00:07:24,712 --> 00:07:28,870 the generative, unsupervised versus supervised approaches. 116 00:07:28,870 --> 00:07:32,718 Do you share a bit about how your thinking about that has evolved over time? 117 00:07:32,718 --> 00:07:37,407 >> Yeah, I think that's a really, I feel like it's a very important topic, 118 00:07:37,407 --> 00:07:41,727 particularly if we think about unsupervised, semi-supervised or 119 00:07:41,727 --> 00:07:46,493 generative models because to some extent a lot of successes that we've seen 120 00:07:46,493 --> 00:07:50,812 recently is due to supervised learning, and back in the early days, 121 00:07:50,812 --> 00:07:55,803 unsupervised learning was primarily viewed as unsupervised pre-training, 122 00:07:55,803 --> 00:07:59,850 because we didn't know how to train these multi layer systems. 123 00:07:59,850 --> 00:08:04,320 And even today, if you're working in settings where you have lots and 124 00:08:04,320 --> 00:08:08,488 lots of unlabeled data and a small fraction of labeled examples, 125 00:08:08,488 --> 00:08:11,371 these unsupervised pre-training models, 126 00:08:11,371 --> 00:08:15,721 building these generative models, can help for supervised eyes. 127 00:08:15,721 --> 00:08:20,863 So I think that a lot of us in the community, it kind of was the belief. 128 00:08:20,863 --> 00:08:25,838 When I started doing my PhD, was all about generative models and trying to learn 129 00:08:25,838 --> 00:08:30,900 these stacks of model because that was the only way for us to train these systems. 130 00:08:30,900 --> 00:08:35,720 Today, there is a lot of work right now in generative modeling. 131 00:08:35,720 --> 00:08:38,530 If you look at Generative Adversarial Networks. 132 00:08:38,530 --> 00:08:40,397 If you look at variational auto-encoders, 133 00:08:40,397 --> 00:08:45,620 deep energy models is something that my lab is working on right now as well. 134 00:08:45,620 --> 00:08:50,660 I think it's very exciting research, but perhaps we haven't quite figured it out, 135 00:08:50,660 --> 00:08:55,420 again, for many of you who are thinking about getting into deep learning field, 136 00:08:55,420 --> 00:08:59,340 this is one area that's, I think we'll make a lot of progress in, 137 00:08:59,340 --> 00:09:01,187 hopefully in the near future. 138 00:09:01,187 --> 00:09:02,214 >> So, unsupervised learning. 139 00:09:02,214 --> 00:09:04,281 >> Unsupervised learning, right. 140 00:09:04,281 --> 00:09:07,751 Or maybe you can think of it as unsupervised learning, or 141 00:09:07,751 --> 00:09:12,886 semi-supervised learning, where you have, I give you some hints or some examples 142 00:09:12,886 --> 00:09:18,240 of what different things mean and I throw you lots and lots of unlabeled data. 143 00:09:18,240 --> 00:09:21,506 >> So that was actually a very important insight that in an earlier era of 144 00:09:21,506 --> 00:09:23,929 deep learning where computers where just slower, 145 00:09:23,929 --> 00:09:27,763 the Restricted Boltzmann Machine and Deep Boltzmann Machine that was needed for 146 00:09:27,763 --> 00:09:31,257 initializing the neural network weights, but as computers got faster, 147 00:09:31,257 --> 00:09:34,700 straight backprop then started to work much better. 148 00:09:34,700 --> 00:09:39,220 So one other topic that I know you spend a lot of time thinking about is 149 00:09:39,220 --> 00:09:45,342 the supervised learning versus generative models, unsupervised learning approaches. 150 00:09:45,342 --> 00:09:46,619 So how does your, 151 00:09:46,619 --> 00:09:51,920 tell me a bit about how your thinking on that debate has evolved over time? 152 00:09:51,920 --> 00:09:56,780 >> I think that we all believe that we should be able to make progress there. 153 00:09:56,780 --> 00:10:03,990 It's just all the work on Boltz machines, variational auto-encoders, GANs. 154 00:10:03,990 --> 00:10:08,556 You think a lot of these models as generative models, but 155 00:10:08,556 --> 00:10:13,595 we haven't quite figured it out how to really make them work and 156 00:10:13,595 --> 00:10:16,654 how can you make use of large moments. 157 00:10:16,654 --> 00:10:21,797 And even for, I see a lot of in IT sector, companies have lots and 158 00:10:21,797 --> 00:10:26,463 lots of data, lots of unlabeled data, lots of efforts for 159 00:10:26,463 --> 00:10:30,848 going through annotations because that's the only way for 160 00:10:30,848 --> 00:10:33,350 us to make progress right now. 161 00:10:33,350 --> 00:10:36,400 And it seems like we should be able to 162 00:10:36,400 --> 00:10:40,200 make use of unlabeled data because it's just abundance of it. 163 00:10:40,200 --> 00:10:42,190 And we haven't quite figured out how to do that yet. 164 00:10:44,020 --> 00:10:48,300 >> So you mentioned for people wanting to enter deep learning research, 165 00:10:48,300 --> 00:10:50,920 unsupervised learning is exciting area. 166 00:10:50,920 --> 00:10:54,240 Today there are a lot of people wanting to enter deep learning, 167 00:10:54,240 --> 00:10:57,688 either research or applied work, so for this global community, 168 00:10:57,688 --> 00:11:01,490 either research or applied work, what advice would you have? 169 00:11:01,490 --> 00:11:06,680 >> Yes, I think that one of the key advices I think I 170 00:11:06,680 --> 00:11:10,620 should give is people entering that field, 171 00:11:10,620 --> 00:11:14,210 I would encourage them to just try different things and 172 00:11:14,210 --> 00:11:18,280 not be afraid to try new things, and not be afraid to try to innovate. 173 00:11:18,280 --> 00:11:20,135 I can give you one example, 174 00:11:20,135 --> 00:11:24,975 which is when I was a graduate student, we were looking at neural nets, 175 00:11:24,975 --> 00:11:29,680 and these are highly non-convex systems that are hard to optimize. 176 00:11:29,680 --> 00:11:33,780 And I remember talking to my friends within the optimization community. 177 00:11:33,780 --> 00:11:38,350 And the feedback was always that, well, there's no way you can solve these 178 00:11:38,350 --> 00:11:41,600 problems because these are non-convex, we don't understand optimization, 179 00:11:41,600 --> 00:11:46,470 how could you ever even do that compared to doing convex optimization? 180 00:11:46,470 --> 00:11:51,225 And it was surprising, because in our lab we never really 181 00:11:51,225 --> 00:11:55,520 cared that much about those specific problems. 182 00:11:55,520 --> 00:11:58,005 We're thinking about how can we optimize and 183 00:11:58,005 --> 00:11:59,960 whether we can get interesting results. 184 00:11:59,960 --> 00:12:04,150 And that effectively was driving the community so 185 00:12:04,150 --> 00:12:09,038 we we're not scared, maybe to some extent because we were 186 00:12:09,038 --> 00:12:13,444 lacking actually the theory behind optimization. 187 00:12:13,444 --> 00:12:16,123 But I would encourage people to just try and 188 00:12:16,123 --> 00:12:19,200 not be afraid to try to tackle hard problems. 189 00:12:19,200 --> 00:12:22,616 >> Yeah, and I remember you once said, don't learn to code just into high level 190 00:12:22,616 --> 00:12:25,740 deep learning frameworks, but actually understand deep learning. 191 00:12:25,740 --> 00:12:26,370 >> Yes, that's right. 192 00:12:26,370 --> 00:12:30,992 I think that it's one of the things that I try to do when I teach a deep learning 193 00:12:30,992 --> 00:12:35,182 class is, one of the homeworks, I'm asking people to actually code 194 00:12:35,182 --> 00:12:39,323 backpropogation algorithms for convolutional neural networks. 195 00:12:39,323 --> 00:12:43,379 And it's painful, but at the same time, if you do it once, 196 00:12:43,379 --> 00:12:48,510 you'll really understand how these systems operate, and how they work. 197 00:12:49,540 --> 00:12:53,223 And how you can efficiently implement them on GPU, and 198 00:12:53,223 --> 00:12:58,266 I think it's important for you to, when you go into research or industry, 199 00:12:58,266 --> 00:13:03,013 you have a really good understanding of what these systems are doing. 200 00:13:03,013 --> 00:13:05,450 So it's important, I think. 201 00:13:05,450 --> 00:13:09,160 >> Since you have both academic experience as professor, and 202 00:13:09,160 --> 00:13:13,730 corporate experience, I'm curious, if someone wants to enter deep learning, 203 00:13:13,730 --> 00:13:18,290 what are the pros and cons of doing a PhD versus joining a company? 204 00:13:18,290 --> 00:13:21,290 >> Yeah, I think that's actually a very good question. 205 00:13:22,660 --> 00:13:25,780 In my particular lab, I have a mix of students. 206 00:13:25,780 --> 00:13:28,850 Some students want to go and take an academic route. 207 00:13:28,850 --> 00:13:32,041 Some students want to go and take an industry route. 208 00:13:32,041 --> 00:13:38,290 And it's becoming very challenging because you can do amazing research in industry, 209 00:13:38,290 --> 00:13:41,910 and you can also do amazing research in academia. 210 00:13:41,910 --> 00:13:46,480 But in terms of pros and cons, in academia, 211 00:13:46,480 --> 00:13:53,180 I feel like you have more freedom to work on long-term problems, or if you think 212 00:13:53,180 --> 00:13:59,150 about some crazy problem, you can work on it, so you have a little bit more freedom. 213 00:13:59,150 --> 00:14:03,940 At the same time the research that you're doing in industry is also very exciting 214 00:14:03,940 --> 00:14:08,920 because in many cases with your research you can impact 215 00:14:08,920 --> 00:14:14,470 millions of users if you develop a core AI technology. 216 00:14:14,470 --> 00:14:19,473 And obviously, within the industry you have much more 217 00:14:19,473 --> 00:14:26,120 resources in terms of Compute, and be able to do really amazing things. 218 00:14:26,120 --> 00:14:30,260 So there are pluses and minuses, it really depends on what you want to do. 219 00:14:30,260 --> 00:14:32,630 And right now it's interesting, 220 00:14:32,630 --> 00:14:36,860 very interesting environment where academics move to industry, and 221 00:14:36,860 --> 00:14:40,450 then folks from industry move to academia, but not as much. 222 00:14:40,450 --> 00:14:45,756 And so it's, it's very exciting times. 223 00:14:45,756 --> 00:14:49,244 >> It sounds like academic machine learning is great and corporate machine 224 00:14:49,244 --> 00:14:52,800 learning is great, and the most important thing is just jump in, right? 225 00:14:52,800 --> 00:14:54,070 Either one, just jump in. 226 00:14:54,070 --> 00:14:58,870 >> It really depends on your preferences because you can do amazing research in 227 00:14:58,870 --> 00:14:59,800 either place. 228 00:14:59,800 --> 00:15:03,301 >> So you've mentioned unsupervised learning is one exciting frontier for 229 00:15:03,301 --> 00:15:04,260 research. 230 00:15:04,260 --> 00:15:08,850 Are there other areas that you consider exciting frontiers for research? 231 00:15:08,850 --> 00:15:09,700 >> Yeah, absolutely. 232 00:15:09,700 --> 00:15:12,520 I think that what I see now, in the community right now, 233 00:15:12,520 --> 00:15:16,010 particularly in deep learning community, is there are a few trends. 234 00:15:17,400 --> 00:15:20,390 One particular area I think is really exciting 235 00:15:20,390 --> 00:15:22,909 is the area of deep reinforcement learning. 236 00:15:24,110 --> 00:15:28,940 Because we were able to figure out how we could train agents in virtual worlds. 237 00:15:28,940 --> 00:15:33,633 And this is something that in just the last couple of years, you see a lot, 238 00:15:33,633 --> 00:15:38,251 of lot of progress, of how can we scale these systems, how can we develop 239 00:15:38,251 --> 00:15:42,643 new algorithms, how can we get agents to communicate to each other, 240 00:15:42,643 --> 00:15:46,731 with each other, and I think that that area is, and in general, 241 00:15:46,731 --> 00:15:52,004 the settings where you're interacting with the environment is super exciting. 242 00:15:52,004 --> 00:15:55,230 The other area that I think is really exciting as 243 00:15:55,230 --> 00:16:00,720 well is the area of reasoning and natural language understanding. 244 00:16:00,720 --> 00:16:03,810 So can we build dialogue-based systems? 245 00:16:03,810 --> 00:16:09,120 Can we build systems that can reason, that can read text and 246 00:16:09,120 --> 00:16:12,730 be able to answer questions intelligently. 247 00:16:12,730 --> 00:16:17,670 I think this is something that a lot of research is focusing on right now. 248 00:16:17,670 --> 00:16:21,832 And then there's another sort of sub-area also is 249 00:16:21,832 --> 00:16:26,382 this area of being able to learn from few examples. 250 00:16:26,382 --> 00:16:31,210 So typically people think of it as one-shot learning or transfer learning, 251 00:16:31,210 --> 00:16:36,970 a setting where you learn something about the world, 252 00:16:36,970 --> 00:16:41,500 and I throw you a new task at you and you can solve this task very quickly. 253 00:16:41,500 --> 00:16:46,770 Much like humans do without requiring lots and lots of labeled examples. 254 00:16:46,770 --> 00:16:52,051 And so this is something that's, a lot of us in the community are trying to figure 255 00:16:52,051 --> 00:16:58,010 out how we can do that and how can we come closer to human-like learning abilities. 256 00:16:58,010 --> 00:17:00,790 >> Thank you, Rus, for sharing all the comments and insights. 257 00:17:00,790 --> 00:17:02,205 That was interesting to see, 258 00:17:02,205 --> 00:17:04,870 hearing the story of your early days doing this as well. 259 00:17:04,870 --> 00:17:05,660 >> [LAUGH]. Thanks, Andrew, yeah. 260 00:17:07,100 --> 00:17:07,800 Thanks for having me.