1 00:00:02,420 --> 00:00:04,575 So, thanks a lot, Pieter, 2 00:00:04,575 --> 00:00:06,690 for joining me today. 3 00:00:06,690 --> 00:00:08,560 I think a lot of people know you as 4 00:00:08,560 --> 00:00:12,150 a well-known machine learning and deep learning and robotics researcher. 5 00:00:12,150 --> 00:00:15,550 I'd like to have people hear a bit about your story. 6 00:00:15,550 --> 00:00:18,220 How did you end up doing the work that you do? 7 00:00:18,220 --> 00:00:22,300 That's a good question and actually if you would have asked me as a 14-year-old, 8 00:00:22,300 --> 00:00:24,775 what I was aspiring to do, 9 00:00:24,775 --> 00:00:26,775 it probably would not have been this. 10 00:00:26,775 --> 00:00:28,285 In fact, at the time, 11 00:00:28,285 --> 00:00:32,565 I thought being a professional basketball player would be the right way to go. 12 00:00:32,565 --> 00:00:34,680 I don't think I was able to achieve it. 13 00:00:34,680 --> 00:00:36,430 I feel the machine learning lucked out, 14 00:00:36,430 --> 00:00:38,250 that the basketball thing didn't work out. 15 00:00:38,250 --> 00:00:39,510 Yes, that didn't work out. 16 00:00:39,510 --> 00:00:41,890 It was a lot of fun playing basketball but it didn't work 17 00:00:41,890 --> 00:00:44,885 out to try to make it into a career. 18 00:00:44,885 --> 00:00:48,530 So, what I really liked in school was physics and math. 19 00:00:48,530 --> 00:00:50,005 And so, from there, 20 00:00:50,005 --> 00:00:52,120 it seemed pretty natural to study engineering which 21 00:00:52,120 --> 00:00:55,735 is applying physics and math in the real world. 22 00:00:55,735 --> 00:00:58,150 And actually then, after my undergrad in electrical engineering, 23 00:00:58,150 --> 00:01:00,355 I actually wasn't so sure what to do because, 24 00:01:00,355 --> 00:01:03,981 literally, anything engineering seemed interesting to me. 25 00:01:03,981 --> 00:01:07,680 Understanding how anything works seems interesting. 26 00:01:07,680 --> 00:01:09,595 Trying to build anything is interesting. 27 00:01:09,595 --> 00:01:11,470 And in some sense, 28 00:01:11,470 --> 00:01:13,690 artificial intelligence won out because it seemed like it 29 00:01:13,690 --> 00:01:18,280 could somehow help all disciplines in some way. 30 00:01:18,280 --> 00:01:22,370 And also, it seemed somehow a little more at the core of everything. 31 00:01:22,370 --> 00:01:24,575 You think about how a machine can think, 32 00:01:24,575 --> 00:01:30,160 then maybe that's more the core of everything else than picking any specific discipline. 33 00:01:30,160 --> 00:01:33,260 I've been saying AI is the new electricity, 34 00:01:33,260 --> 00:01:35,020 sounds like the 14-year-old version of you; 35 00:01:35,020 --> 00:01:37,923 had an earlier version of that even. 36 00:01:37,923 --> 00:01:44,465 You know, in the past few years you've done a lot of work in deep reinforcement learning. 37 00:01:44,465 --> 00:01:49,315 What's happening? Why is deep reinforcement learning suddenly taking off? 38 00:01:49,315 --> 00:01:51,030 Before I worked in deep reinforcement learning, 39 00:01:51,030 --> 00:01:52,765 I worked a lot in reinforcement learning; 40 00:01:52,765 --> 00:01:56,115 actually with you and Durant at Stanford, of course. 41 00:01:56,115 --> 00:01:59,863 And so, we worked on autonomous helicopter flight, 42 00:01:59,863 --> 00:02:02,440 then later at Berkeley with some of my students who worked 43 00:02:02,440 --> 00:02:05,440 on getting a robot to learn to fold laundry. 44 00:02:05,440 --> 00:02:09,340 And kind of what characterized the work was a combination 45 00:02:09,340 --> 00:02:13,015 of learning that enabled things that would not be possible without learning, 46 00:02:13,015 --> 00:02:18,120 but also a lot of domain expertise in combination with the learning to get this to work. 47 00:02:18,120 --> 00:02:20,975 And it was very 48 00:02:20,975 --> 00:02:22,600 interesting because you needed domain expertise which 49 00:02:22,600 --> 00:02:24,310 was fun to acquire but, at the same time, 50 00:02:24,310 --> 00:02:28,234 was very time-consuming for every new application you wanted to succeed of; 51 00:02:28,234 --> 00:02:31,060 you needed domain expertise plus machine learning expertise. 52 00:02:31,060 --> 00:02:34,240 And for me it was in 2012 with 53 00:02:34,240 --> 00:02:39,910 the ImageNet breakthrough results from Geoff Hinton's group in Toronto, 54 00:02:39,910 --> 00:02:42,880 AlexNet showing that supervised learning, all of a sudden, 55 00:02:42,880 --> 00:02:48,220 could be done with far less engineering for the domain at hand. 56 00:02:48,220 --> 00:02:50,410 There was very little engineering by vision in AlexNet. 57 00:02:50,410 --> 00:02:53,075 It made me think we really should revisit 58 00:02:53,075 --> 00:02:57,610 reinforcement learning under the same kind of viewpoint and see if we can 59 00:02:57,610 --> 00:03:01,075 get the diversion of reinforcement learning to work and do 60 00:03:01,075 --> 00:03:05,950 equally interesting things as had just happened in the supervised learning. 61 00:03:05,950 --> 00:03:08,565 It sounds like you saw earlier than 62 00:03:08,565 --> 00:03:12,250 most people the potential of deep reinforcement learning. 63 00:03:12,250 --> 00:03:14,365 So now looking in to the future, 64 00:03:14,365 --> 00:03:16,180 what do you see next? 65 00:03:16,180 --> 00:03:17,260 What are your predictions for the 66 00:03:17,260 --> 00:03:20,440 next several ways to come in deep reinforcement learning? 67 00:03:20,440 --> 00:03:23,270 So, I think what's interesting about deep reinforcement learning is that, 68 00:03:23,270 --> 00:03:26,795 in some sense, there is many more questions than in supervised learning. 69 00:03:26,795 --> 00:03:29,817 In supervised learning, it's about learning an input output mapping. 70 00:03:29,817 --> 00:03:34,505 In reinforcement learning there is the notion of: Where does the data even come from? 71 00:03:34,505 --> 00:03:36,580 So that's the exploration problem. 72 00:03:36,580 --> 00:03:38,470 When you have data, how do you do credit assignment? 73 00:03:38,470 --> 00:03:43,315 How do you understand what actions you took early on got you the reward later? 74 00:03:43,315 --> 00:03:44,830 And then, there is issues of safety. 75 00:03:44,830 --> 00:03:47,335 When you have a system autonomously collecting data, 76 00:03:47,335 --> 00:03:50,140 it's actually rather dangerous in most situations. 77 00:03:50,140 --> 00:03:51,880 Imagine a self-driving car company that says, 78 00:03:51,880 --> 00:03:53,825 we're just going to run deep reinforcement learning. 79 00:03:53,825 --> 00:03:55,690 It's pretty likely that car would get into a lot of 80 00:03:55,690 --> 00:03:57,985 accidents before it does anything useful. 81 00:03:57,985 --> 00:03:59,650 You needed negative examples of that, right? 82 00:03:59,650 --> 00:04:02,000 You do need some negative examples somehow, yes; 83 00:04:02,000 --> 00:04:04,930 and positive ones, hopefully. 84 00:04:04,930 --> 00:04:07,540 So, I think there is still a lot of challenges in 85 00:04:07,540 --> 00:04:09,760 deep reinforcement learning in terms of 86 00:04:09,760 --> 00:04:12,635 working out some of the specifics of how to get these things to work. 87 00:04:12,635 --> 00:04:14,520 So, the deep part is the representation, 88 00:04:14,520 --> 00:04:18,455 but then the reinforcement learning itself still has a lot of questions. 89 00:04:18,455 --> 00:04:20,485 And what I feel is that, 90 00:04:20,485 --> 00:04:22,810 with the advances in deep learning, 91 00:04:22,810 --> 00:04:27,430 somehow one part of the puzzle in reinforcement learning has been largely addressed, 92 00:04:27,430 --> 00:04:29,075 which is the representation part. 93 00:04:29,075 --> 00:04:31,540 So, if there is a pattern we can 94 00:04:31,540 --> 00:04:34,795 probably represent it with a deep network and capture that pattern. 95 00:04:34,795 --> 00:04:39,400 And how to tease apart the pattern is still a big challenge in reinforcement learning. 96 00:04:39,400 --> 00:04:41,740 So I think big challenges are, 97 00:04:41,740 --> 00:04:45,695 how to get systems to reason over long time horizons. 98 00:04:45,695 --> 00:04:47,770 So right now, a lot of the successes 99 00:04:47,770 --> 00:04:50,650 in deep reinforcement learning are a very short horizon. 100 00:04:50,650 --> 00:04:52,000 There are problems where, 101 00:04:52,000 --> 00:04:54,445 if you act well over a five second horizon, 102 00:04:54,445 --> 00:04:57,815 you act well over the entire problem. 103 00:04:57,815 --> 00:05:02,599 And so a five second scale is something very different from a day long scale, 104 00:05:02,599 --> 00:05:06,930 or the ability to live a life as a robot or some software agent. 105 00:05:06,930 --> 00:05:09,240 So, I think there's a lot of challenges there. 106 00:05:09,240 --> 00:05:12,790 I think safety has a lot of challenges in terms of, 107 00:05:12,790 --> 00:05:14,920 how do you learn safely and also how do 108 00:05:14,920 --> 00:05:17,785 you keep learning once you're already pretty good? 109 00:05:17,785 --> 00:05:20,305 So, to give an example again that 110 00:05:20,305 --> 00:05:23,070 a lot of people would be familiar with, self-driving cars, 111 00:05:23,070 --> 00:05:26,375 for a self-driving car to be better than a human driver, 112 00:05:26,375 --> 00:05:31,990 should human drivers maybe get into bad accidents every three million miles or something. 113 00:05:31,990 --> 00:05:35,763 And so, that takes a long time to see the negative data; 114 00:05:35,763 --> 00:05:37,510 once you're as good as a human driver. 115 00:05:37,510 --> 00:05:40,835 But you want your self-driving car to be better than a human driver. 116 00:05:40,835 --> 00:05:43,930 And so, at that point the data collection becomes really really difficult to get 117 00:05:43,930 --> 00:05:48,175 that interesting data that makes your system improve. 118 00:05:48,175 --> 00:05:52,420 So, it's a lot of challenges related to exploration, that tie into that. 119 00:05:52,420 --> 00:05:57,190 But one of the things I'm actually most excited about right now is seeing 120 00:05:57,190 --> 00:06:02,720 if we can actually take a step back and also learn the reinforcement learning algorithm. 121 00:06:02,720 --> 00:06:05,030 So, reinforcement is very complex, 122 00:06:05,030 --> 00:06:07,450 credit assignment is very complex, exploration is very complex. 123 00:06:07,450 --> 00:06:08,905 And so maybe, just like 124 00:06:08,905 --> 00:06:13,795 how deep learning for supervised learning was able to replace a lot of domain expertise, 125 00:06:13,795 --> 00:06:17,320 maybe we can have programs that are learned, 126 00:06:17,320 --> 00:06:20,140 that are reinforcement learning programs that do all this, 127 00:06:20,140 --> 00:06:22,510 instead of us designing the details. 128 00:06:22,510 --> 00:06:25,560 During the reward function or during the whole program? 129 00:06:25,560 --> 00:06:28,150 So, this would be learning the entire reinforcement learning program. 130 00:06:28,150 --> 00:06:30,430 So, it would be, imagine, 131 00:06:30,430 --> 00:06:34,255 you have a reinforcement learning program, whatever it is, 132 00:06:34,255 --> 00:06:38,320 and you throw it out some problem and then you see how long it takes to learn. 133 00:06:38,320 --> 00:06:41,020 And then you say, well, that took a while. 134 00:06:41,020 --> 00:06:44,950 Now, let another program modify this reinforcement learning program. 135 00:06:44,950 --> 00:06:48,045 After the modification, see how fast it learns. 136 00:06:48,045 --> 00:06:49,641 If it learns more quickly, 137 00:06:49,641 --> 00:06:54,380 that was a good modification and maybe keep it and improve from there. 138 00:06:54,380 --> 00:06:57,630 Well, I see, right. Yes, and pace the direction. 139 00:06:57,630 --> 00:06:59,290 I think it has a lot to do with, maybe, 140 00:06:59,290 --> 00:07:01,510 the amount of compute that's becoming available. 141 00:07:01,510 --> 00:07:05,860 So, this would be running reinforcement learning in the inner loop. 142 00:07:05,860 --> 00:07:08,975 For us right now, we run reinforcement learning as the final thing. 143 00:07:08,975 --> 00:07:11,260 And so, the more compute we get, 144 00:07:11,260 --> 00:07:14,545 the more it becomes possible to maybe run something 145 00:07:14,545 --> 00:07:19,160 like reinforcement learning in the inner loop of a bigger algorithm. 146 00:07:19,160 --> 00:07:22,080 Starting from the 14-year-old, 147 00:07:22,080 --> 00:07:25,355 you've worked in AI for some 20 plus years now. 148 00:07:25,355 --> 00:07:32,795 So, tell me a bit about how your understanding of AI has evolved over this time. 149 00:07:32,795 --> 00:07:35,280 When I started looking at AI, 150 00:07:35,280 --> 00:07:38,230 it's very interesting because it really 151 00:07:38,230 --> 00:07:41,445 coincided with coming to Stanford to do my master's degree there, 152 00:07:41,445 --> 00:07:46,998 and there were some icons there like John McCarthy who I got to talk with, 153 00:07:46,998 --> 00:07:49,300 but who had a very different approach to, 154 00:07:49,300 --> 00:07:50,460 and in the year 2000, 155 00:07:50,460 --> 00:07:52,115 for what most people were doing at the time. 156 00:07:52,115 --> 00:07:54,958 And also talking with Daphne Koller. 157 00:07:54,958 --> 00:07:59,320 And I think a lot of my initial thinking of AI was shaped by Daphne's thinking. 158 00:07:59,320 --> 00:08:04,300 Her AI class, her probabilistic graphical models class, 159 00:08:04,300 --> 00:08:06,820 and kind of really being intrigued by 160 00:08:06,820 --> 00:08:11,450 how simply a distribution of her many random variables and then being able to condition 161 00:08:11,450 --> 00:08:14,950 on some subsets variables and draw on conclusions about others could 162 00:08:14,950 --> 00:08:19,015 actually give you so much if you can somehow make it computationally attractable, 163 00:08:19,015 --> 00:08:23,170 which was definitely the challenge to make it computable. 164 00:08:23,170 --> 00:08:25,090 And then from there, 165 00:08:25,090 --> 00:08:28,335 when I started my Ph.D. And you arrived at Stanford, 166 00:08:28,335 --> 00:08:30,910 and I think you give me a really good reality check, 167 00:08:30,910 --> 00:08:35,350 that that's not the right metric to evaluate your work by, 168 00:08:35,350 --> 00:08:38,470 and to really try to see the connection from what 169 00:08:38,470 --> 00:08:41,710 you're working on to what impact they can really have, 170 00:08:41,710 --> 00:08:46,660 what change it can make rather than what's the math that happened to be in your work. 171 00:08:46,660 --> 00:08:48,425 Right. That's amazing. 172 00:08:48,425 --> 00:08:50,685 I did not realize, I've forgotten that. 173 00:08:50,685 --> 00:08:54,267 Yes, it's actually one of the things, aside most often that people asking, 174 00:08:54,267 --> 00:09:01,090 if you going to cite only one thing that has stuck with you from Andrew's advice, 175 00:09:01,090 --> 00:09:05,995 it's making sure you can see the connection to where it's actually going to do something. 176 00:09:05,995 --> 00:09:11,332 You've had and you're continuing to have an amazing career in AI. 177 00:09:11,332 --> 00:09:14,750 So, for some of the people listening to you on video now, 178 00:09:14,750 --> 00:09:18,815 if they want to also enter or pursue a career in AI, 179 00:09:18,815 --> 00:09:20,985 what advice do you have for them? 180 00:09:20,985 --> 00:09:25,185 I think it's a really good time to get into artificial intelligence. 181 00:09:25,185 --> 00:09:28,965 If you look at the demand for people, it's so high, 182 00:09:28,965 --> 00:09:30,741 there is so many job opportunities, 183 00:09:30,741 --> 00:09:32,365 so many things you can do, researchwise, 184 00:09:32,365 --> 00:09:34,735 build new companies and so forth. 185 00:09:34,735 --> 00:09:39,240 So, I'd say yes, it's definitely a smart decision in terms of actually getting going. 186 00:09:39,240 --> 00:09:41,140 A lot of it, you can self-study, 187 00:09:41,140 --> 00:09:42,635 whether you're in school or not. 188 00:09:42,635 --> 00:09:44,150 There is a lot of online courses, for instance, 189 00:09:44,150 --> 00:09:45,585 your machine learning course, 190 00:09:45,585 --> 00:09:48,400 there is also, for example, 191 00:09:48,400 --> 00:09:52,030 Andrej Karpathy's deep learning course which has videos online, 192 00:09:52,030 --> 00:09:54,280 which is a great way to get started, 193 00:09:54,280 --> 00:09:57,460 Berkeley who has a deep reinforcement learning course 194 00:09:57,460 --> 00:09:59,260 which has all of the lectures online. 195 00:09:59,260 --> 00:10:01,235 So, those are all good places to get started. 196 00:10:01,235 --> 00:10:06,470 I think a big part of what's important is to make sure you try things yourself. 197 00:10:06,470 --> 00:10:10,055 So, not just read things or watch videos but try things out. 198 00:10:10,055 --> 00:10:14,347 With frameworks like TensorFlow, 199 00:10:14,347 --> 00:10:16,040 Chainer, Theano, PyTorch and so forth, 200 00:10:16,040 --> 00:10:17,350 I mean whatever is your favorite, 201 00:10:17,350 --> 00:10:21,980 it's very easy to get going and get something up and running very quickly. 202 00:10:21,980 --> 00:10:24,669 To get to practice yourself, right? 203 00:10:24,669 --> 00:10:27,105 With implementing and seeing what does and seeing what doesn't work. 204 00:10:27,105 --> 00:10:29,360 So, this past week there was an article in 205 00:10:29,360 --> 00:10:31,715 Mashable about a 16-year-old in United Kingdom, 206 00:10:31,715 --> 00:10:34,580 who is one of the leaders on Kaggle competitions. 207 00:10:34,580 --> 00:10:36,690 And it just said, 208 00:10:36,690 --> 00:10:39,290 he just went out and learned things, 209 00:10:39,290 --> 00:10:41,510 found things online, learned everything himself and 210 00:10:41,510 --> 00:10:44,915 never actually took any formal course per se. 211 00:10:44,915 --> 00:10:49,180 And there is a 16-year-old just being very competitive in Kaggle competition, 212 00:10:49,180 --> 00:10:50,990 so it's definitely possible. 213 00:10:50,990 --> 00:10:53,120 We live in good times. 214 00:10:53,120 --> 00:10:54,560 If people want to learn. 215 00:10:54,560 --> 00:10:55,940 Absolutely. 216 00:10:55,940 --> 00:10:57,980 One question I bet you get all sometimes 217 00:10:57,980 --> 00:11:00,160 is if someone wants to enter AI machine learning and deep learning, 218 00:11:00,160 --> 00:11:06,885 should they apply for a Ph.D. program or should they get the job with a big company? 219 00:11:06,885 --> 00:11:12,395 I think a lot of it has to do with maybe how much mentoring you can get. 220 00:11:12,395 --> 00:11:14,780 So, in a Ph.D. program, 221 00:11:14,780 --> 00:11:16,400 you're such a guaranteed, 222 00:11:16,400 --> 00:11:17,787 the job of the professor, 223 00:11:17,787 --> 00:11:18,830 who is your adviser, 224 00:11:18,830 --> 00:11:20,800 is to look out for you. 225 00:11:20,800 --> 00:11:21,950 Try to do everything they can to, 226 00:11:21,950 --> 00:11:23,565 kind of, shape you, 227 00:11:23,565 --> 00:11:28,720 help you become stronger at whatever you want to do, for example, AI. 228 00:11:28,720 --> 00:11:32,060 And so, there is a very clear dedicated person, sometimes you have two advisers. 229 00:11:32,060 --> 00:11:34,955 And that's literally their job and that's why they are professors, 230 00:11:34,955 --> 00:11:37,755 most of what they like about being professors often is helping 231 00:11:37,755 --> 00:11:41,200 shape students to become more capable at things. 232 00:11:41,200 --> 00:11:43,250 Now, it doesn't mean it's not possible at companies, 233 00:11:43,250 --> 00:11:46,730 and many companies have really good mentors and have people who love 234 00:11:46,730 --> 00:11:51,110 to help educate people who come in and strengthen them, and so forth. 235 00:11:51,110 --> 00:11:55,515 It's just, it might not be as much of a guarantee and a given, 236 00:11:55,515 --> 00:12:00,540 compared to actually enrolling in a Ph.D. program or that's the crooks of 237 00:12:00,540 --> 00:12:06,020 the program is that you're going to learn and somebody is there to help you learn. 238 00:12:06,020 --> 00:12:09,675 So it really depends on the company and depends on the Ph.D. program. 239 00:12:09,675 --> 00:12:14,130 Absolutely, yes. But I think it is key that you can learn a lot on your own. 240 00:12:14,130 --> 00:12:17,910 But I think you can learn a lot faster if you have somebody who's more experienced, 241 00:12:17,910 --> 00:12:20,469 who is actually taking it up as 242 00:12:20,469 --> 00:12:24,945 their responsibility to spend time with you and help accelerate your progress. 243 00:12:24,945 --> 00:12:28,780 So, you've been one of the most visible leaders in deep reinforcement learning. 244 00:12:28,780 --> 00:12:30,720 So, what are the things that 245 00:12:30,720 --> 00:12:32,930 deep reinforcement learning is already working really well at? 246 00:12:32,930 --> 00:12:37,450 I think, if you look at some deep reinforcement learning successes, 247 00:12:37,450 --> 00:12:39,000 it's very, very intriguing. 248 00:12:39,000 --> 00:12:42,810 For example, learning to play Atari games from pixels, 249 00:12:42,810 --> 00:12:45,540 processing this pixels which is just numbers that are being 250 00:12:45,540 --> 00:12:49,150 processed somehow and turned into joystick actions. 251 00:12:49,150 --> 00:12:52,605 Then, for example, some of the work we did at Berkeley was, 252 00:12:52,605 --> 00:12:57,105 we have a simulated robot inventing walking and the reward 253 00:12:57,105 --> 00:12:59,340 that it's given is as simple as the further you go north the 254 00:12:59,340 --> 00:13:02,170 better and the less hard you impact with the ground the better. 255 00:13:02,170 --> 00:13:06,949 And somehow it decides that walking slash running is the thing to invent whereas, 256 00:13:06,949 --> 00:13:10,095 nobody showed it, what walking is or running is. 257 00:13:10,095 --> 00:13:14,220 Or robot playing with children's stories and learn to kind of put them together, 258 00:13:14,220 --> 00:13:16,935 put a block into matching opening, and so forth. 259 00:13:16,935 --> 00:13:20,280 And so, I think it's really interesting that in all of these it's possible to learn 260 00:13:20,280 --> 00:13:24,510 from raw sensory inputs all the way to raw controls, 261 00:13:24,510 --> 00:13:27,990 for example, torques at the motors. 262 00:13:27,990 --> 00:13:29,225 But at the same time. 263 00:13:29,225 --> 00:13:32,460 So it is very interesting that you can have a single algorithm. 264 00:13:32,460 --> 00:13:35,310 For example, you know thrust is impulsive and you can learn, 265 00:13:35,310 --> 00:13:36,745 can have a robot learn to run, 266 00:13:36,745 --> 00:13:38,135 can have a robot learn to stand up, 267 00:13:38,135 --> 00:13:40,395 can have instead of a two legged robot, 268 00:13:40,395 --> 00:13:42,445 now you're swapping a four legged robot. 269 00:13:42,445 --> 00:13:46,465 You run the same reinforcement algorithm and it still learns to run. 270 00:13:46,465 --> 00:13:49,280 And so, there is no change in the reinforcement algorithm. 271 00:13:49,280 --> 00:13:51,615 It's very, very general. Same for the Atari games. 272 00:13:51,615 --> 00:13:54,565 DQN was the same DQN for every one of the games. 273 00:13:54,565 --> 00:13:56,640 But then, when it actually starts hitting 274 00:13:56,640 --> 00:14:00,060 the frontiers of what's not yet possible as well, 275 00:14:00,060 --> 00:14:03,490 it's nice it learns from scratch for each one of 276 00:14:03,490 --> 00:14:07,405 these tasks but would be even nicer if it could reuse things it's learned in the past; 277 00:14:07,405 --> 00:14:09,640 to learn even more quickly for the next task. 278 00:14:09,640 --> 00:14:13,100 And that's something that's still on the frontier and not yet possible. 279 00:14:13,100 --> 00:14:16,490 It always starts from scratch, essentially. 280 00:14:16,490 --> 00:14:19,390 How quickly, do you think, you see deep 281 00:14:19,390 --> 00:14:22,420 reinforcement learning get deployed in the robots around us, 282 00:14:22,420 --> 00:14:25,935 the robots they're getting deployed in the world today. 283 00:14:25,935 --> 00:14:29,380 I think in practice the realistic scenario is one 284 00:14:29,380 --> 00:14:32,770 where it starts with supervised learning, 285 00:14:32,770 --> 00:14:35,960 behavioral cloning; humans do the work. 286 00:14:35,960 --> 00:14:38,530 And I think a lot of businesses will be built 287 00:14:38,530 --> 00:14:41,790 that way where it's a human behind the scenes doing a lot of the work. 288 00:14:41,790 --> 00:14:44,980 Imagine Facebook Messenger assistant. 289 00:14:44,980 --> 00:14:47,980 Assistant like that could be built with a human behind 290 00:14:47,980 --> 00:14:51,310 the curtains doing a lot of the work; machine learning, 291 00:14:51,310 --> 00:14:54,380 matches up with what the human does and starts making suggestions to 292 00:14:54,380 --> 00:14:58,130 human so the humans has a small number of options that we can just click and select. 293 00:14:58,130 --> 00:14:59,640 And then over time, 294 00:14:59,640 --> 00:15:01,130 as it gets pretty good, 295 00:15:01,130 --> 00:15:04,465 you're starting fusing some reinforcement learning where you give it actual objectives, 296 00:15:04,465 --> 00:15:06,565 not just matching the human behind the curtains 297 00:15:06,565 --> 00:15:09,040 but giving objectives of achievement like, 298 00:15:09,040 --> 00:15:14,110 maybe, how fast were these two people able to plan their meeting? 299 00:15:14,110 --> 00:15:16,385 Or how fast were they able to book their flight? 300 00:15:16,385 --> 00:15:18,340 Or things like that. How long did it take? 301 00:15:18,340 --> 00:15:20,065 How happy were they with it? 302 00:15:20,065 --> 00:15:22,815 But it would probably have to be bootstrap of a lot of 303 00:15:22,815 --> 00:15:27,605 behavioral cloning of humans showing how this could be done. 304 00:15:27,605 --> 00:15:30,690 So it sounds behavioral cloning just supervise learning to 305 00:15:30,690 --> 00:15:33,580 mimic whatever the person is doing and then gradually later on, 306 00:15:33,580 --> 00:15:37,434 the reinforcement learning to have it think about longer time horizons? 307 00:15:37,434 --> 00:15:38,500 Is that a fair summary? 308 00:15:38,500 --> 00:15:39,715 I'd say so, yes. 309 00:15:39,715 --> 00:15:43,540 Just because straight up reinforcement learning from scratch is really fun to watch. 310 00:15:43,540 --> 00:15:46,780 It's super intriguing and very few things more fun to watch 311 00:15:46,780 --> 00:15:50,440 than a reinforcement learning robot starting from nothing and inventing things. 312 00:15:50,440 --> 00:15:54,280 But it's just time consuming and it's not always safe. 313 00:15:54,280 --> 00:15:56,200 Thank you very much. That was fascinating. 314 00:15:56,200 --> 00:15:58,005 I'm really glad we had the chance to chat. 315 00:15:58,005 --> 00:16:02,670 Well, Andrew thank you for having me. Very much appreciate it.