1
00:00:02,420 --> 00:00:04,575
So, thanks a lot, Pieter,

2
00:00:04,575 --> 00:00:06,690
for joining me today.

3
00:00:06,690 --> 00:00:08,560
I think a lot of people know you as

4
00:00:08,560 --> 00:00:12,150
a well-known machine learning and deep learning and robotics researcher.

5
00:00:12,150 --> 00:00:15,550
I'd like to have people hear a bit about your story.

6
00:00:15,550 --> 00:00:18,220
How did you end up doing the work that you do?

7
00:00:18,220 --> 00:00:22,300
That's a good question and actually if you would have asked me as a 14-year-old,

8
00:00:22,300 --> 00:00:24,775
what I was aspiring to do,

9
00:00:24,775 --> 00:00:26,775
it probably would not have been this.

10
00:00:26,775 --> 00:00:28,285
In fact, at the time,

11
00:00:28,285 --> 00:00:32,565
I thought being a professional basketball player would be the right way to go.

12
00:00:32,565 --> 00:00:34,680
I don't think I was able to achieve it.

13
00:00:34,680 --> 00:00:36,430
I feel the machine learning lucked out,

14
00:00:36,430 --> 00:00:38,250
that the basketball thing didn't work out.

15
00:00:38,250 --> 00:00:39,510
Yes, that didn't work out.

16
00:00:39,510 --> 00:00:41,890
It was a lot of fun playing basketball but it didn't work

17
00:00:41,890 --> 00:00:44,885
out to try to make it into a career.

18
00:00:44,885 --> 00:00:48,530
So, what I really liked in school was physics and math.

19
00:00:48,530 --> 00:00:50,005
And so, from there,

20
00:00:50,005 --> 00:00:52,120
it seemed pretty natural to study engineering which

21
00:00:52,120 --> 00:00:55,735
is applying physics and math in the real world.

22
00:00:55,735 --> 00:00:58,150
And actually then, after my undergrad in electrical engineering,

23
00:00:58,150 --> 00:01:00,355
I actually wasn't so sure what to do because,

24
00:01:00,355 --> 00:01:03,981
literally, anything engineering seemed interesting to me.

25
00:01:03,981 --> 00:01:07,680
Understanding how anything works seems interesting.

26
00:01:07,680 --> 00:01:09,595
Trying to build anything is interesting.

27
00:01:09,595 --> 00:01:11,470
And in some sense,

28
00:01:11,470 --> 00:01:13,690
artificial intelligence won out because it seemed like it

29
00:01:13,690 --> 00:01:18,280
could somehow help all disciplines in some way.

30
00:01:18,280 --> 00:01:22,370
And also, it seemed somehow a little more at the core of everything.

31
00:01:22,370 --> 00:01:24,575
You think about how a machine can think,

32
00:01:24,575 --> 00:01:30,160
then maybe that's more the core of everything else than picking any specific discipline.

33
00:01:30,160 --> 00:01:33,260
I've been saying AI is the new electricity,

34
00:01:33,260 --> 00:01:35,020
sounds like the 14-year-old version of you;

35
00:01:35,020 --> 00:01:37,923
had an earlier version of that even.

36
00:01:37,923 --> 00:01:44,465
You know, in the past few years you've done a lot of work in deep reinforcement learning.

37
00:01:44,465 --> 00:01:49,315
What's happening? Why is deep reinforcement learning suddenly taking off?

38
00:01:49,315 --> 00:01:51,030
Before I worked in deep reinforcement learning,

39
00:01:51,030 --> 00:01:52,765
I worked a lot in reinforcement learning;

40
00:01:52,765 --> 00:01:56,115
actually with you and Durant at Stanford, of course.

41
00:01:56,115 --> 00:01:59,863
And so, we worked on autonomous helicopter flight,

42
00:01:59,863 --> 00:02:02,440
then later at Berkeley with some of my students who worked

43
00:02:02,440 --> 00:02:05,440
on getting a robot to learn to fold laundry.

44
00:02:05,440 --> 00:02:09,340
And kind of what characterized the work was a combination

45
00:02:09,340 --> 00:02:13,015
of learning that enabled things that would not be possible without learning,

46
00:02:13,015 --> 00:02:18,120
but also a lot of domain expertise in combination with the learning to get this to work.

47
00:02:18,120 --> 00:02:20,975
And it was very

48
00:02:20,975 --> 00:02:22,600
interesting because you needed domain expertise which

49
00:02:22,600 --> 00:02:24,310
was fun to acquire but, at the same time,

50
00:02:24,310 --> 00:02:28,234
was very time-consuming for every new application you wanted to succeed of;

51
00:02:28,234 --> 00:02:31,060
you needed domain expertise plus machine learning expertise.

52
00:02:31,060 --> 00:02:34,240
And for me it was in 2012 with

53
00:02:34,240 --> 00:02:39,910
the ImageNet breakthrough results from Geoff Hinton's group in Toronto,

54
00:02:39,910 --> 00:02:42,880
AlexNet showing that supervised learning, all of a sudden,

55
00:02:42,880 --> 00:02:48,220
could be done with far less engineering for the domain at hand.

56
00:02:48,220 --> 00:02:50,410
There was very little engineering by vision in AlexNet.

57
00:02:50,410 --> 00:02:53,075
It made me think we really should revisit

58
00:02:53,075 --> 00:02:57,610
reinforcement learning under the same kind of viewpoint and see if we can

59
00:02:57,610 --> 00:03:01,075
get the diversion of reinforcement learning to work and do

60
00:03:01,075 --> 00:03:05,950
equally interesting things as had just happened in the supervised learning.

61
00:03:05,950 --> 00:03:08,565
It sounds like you saw earlier than

62
00:03:08,565 --> 00:03:12,250
most people the potential of deep reinforcement learning.

63
00:03:12,250 --> 00:03:14,365
So now looking in to the future,

64
00:03:14,365 --> 00:03:16,180
what do you see next?

65
00:03:16,180 --> 00:03:17,260
What are your predictions for the

66
00:03:17,260 --> 00:03:20,440
next several ways to come in deep reinforcement learning?

67
00:03:20,440 --> 00:03:23,270
So, I think what's interesting about deep reinforcement learning is that,

68
00:03:23,270 --> 00:03:26,795
in some sense, there is many more questions than in supervised learning.

69
00:03:26,795 --> 00:03:29,817
In supervised learning, it's about learning an input output mapping.

70
00:03:29,817 --> 00:03:34,505
In reinforcement learning there is the notion of: Where does the data even come from?

71
00:03:34,505 --> 00:03:36,580
So that's the exploration problem.

72
00:03:36,580 --> 00:03:38,470
When you have data, how do you do credit assignment?

73
00:03:38,470 --> 00:03:43,315
How do you understand what actions you took early on got you the reward later?

74
00:03:43,315 --> 00:03:44,830
And then, there is issues of safety.

75
00:03:44,830 --> 00:03:47,335
When you have a system autonomously collecting data,

76
00:03:47,335 --> 00:03:50,140
it's actually rather dangerous in most situations.

77
00:03:50,140 --> 00:03:51,880
Imagine a self-driving car company that says,

78
00:03:51,880 --> 00:03:53,825
we're just going to run deep reinforcement learning.

79
00:03:53,825 --> 00:03:55,690
It's pretty likely that car would get into a lot of

80
00:03:55,690 --> 00:03:57,985
accidents before it does anything useful.

81
00:03:57,985 --> 00:03:59,650
You needed negative examples of that, right?

82
00:03:59,650 --> 00:04:02,000
You do need some negative examples somehow, yes;

83
00:04:02,000 --> 00:04:04,930
and positive ones, hopefully.

84
00:04:04,930 --> 00:04:07,540
So, I think there is still a lot of challenges in

85
00:04:07,540 --> 00:04:09,760
deep reinforcement learning in terms of

86
00:04:09,760 --> 00:04:12,635
working out some of the specifics of how to get these things to work.

87
00:04:12,635 --> 00:04:14,520
So, the deep part is the representation,

88
00:04:14,520 --> 00:04:18,455
but then the reinforcement learning itself still has a lot of questions.

89
00:04:18,455 --> 00:04:20,485
And what I feel is that,

90
00:04:20,485 --> 00:04:22,810
with the advances in deep learning,

91
00:04:22,810 --> 00:04:27,430
somehow one part of the puzzle in reinforcement learning has been largely addressed,

92
00:04:27,430 --> 00:04:29,075
which is the representation part.

93
00:04:29,075 --> 00:04:31,540
So, if there is a pattern we can

94
00:04:31,540 --> 00:04:34,795
probably represent it with a deep network and capture that pattern.

95
00:04:34,795 --> 00:04:39,400
And how to tease apart the pattern is still a big challenge in reinforcement learning.

96
00:04:39,400 --> 00:04:41,740
So I think big challenges are,

97
00:04:41,740 --> 00:04:45,695
how to get systems to reason over long time horizons.

98
00:04:45,695 --> 00:04:47,770
So right now, a lot of the successes

99
00:04:47,770 --> 00:04:50,650
in deep reinforcement learning are a very short horizon.

100
00:04:50,650 --> 00:04:52,000
There are problems where,

101
00:04:52,000 --> 00:04:54,445
if you act well over a five second horizon,

102
00:04:54,445 --> 00:04:57,815
you act well over the entire problem.

103
00:04:57,815 --> 00:05:02,599
And so a five second scale is something very different from a day long scale,

104
00:05:02,599 --> 00:05:06,930
or the ability to live a life as a robot or some software agent.

105
00:05:06,930 --> 00:05:09,240
So, I think there's a lot of challenges there.

106
00:05:09,240 --> 00:05:12,790
I think safety has a lot of challenges in terms of,

107
00:05:12,790 --> 00:05:14,920
how do you learn safely and also how do

108
00:05:14,920 --> 00:05:17,785
you keep learning once you're already pretty good?

109
00:05:17,785 --> 00:05:20,305
So, to give an example again that

110
00:05:20,305 --> 00:05:23,070
a lot of people would be familiar with, self-driving cars,

111
00:05:23,070 --> 00:05:26,375
for a self-driving car to be better than a human driver,

112
00:05:26,375 --> 00:05:31,990
should human drivers maybe get into bad accidents every three million miles or something.

113
00:05:31,990 --> 00:05:35,763
And so, that takes a long time to see the negative data;

114
00:05:35,763 --> 00:05:37,510
once you're as good as a human driver.

115
00:05:37,510 --> 00:05:40,835
But you want your self-driving car to be better than a human driver.

116
00:05:40,835 --> 00:05:43,930
And so, at that point the data collection becomes really really difficult to get

117
00:05:43,930 --> 00:05:48,175
that interesting data that makes your system improve.

118
00:05:48,175 --> 00:05:52,420
So, it's a lot of challenges related to exploration, that tie into that.

119
00:05:52,420 --> 00:05:57,190
But one of the things I'm actually most excited about right now is seeing

120
00:05:57,190 --> 00:06:02,720
if we can actually take a step back and also learn the reinforcement learning algorithm.

121
00:06:02,720 --> 00:06:05,030
So, reinforcement is very complex,

122
00:06:05,030 --> 00:06:07,450
credit assignment is very complex, exploration is very complex.

123
00:06:07,450 --> 00:06:08,905
And so maybe, just like

124
00:06:08,905 --> 00:06:13,795
how deep learning for supervised learning was able to replace a lot of domain expertise,

125
00:06:13,795 --> 00:06:17,320
maybe we can have programs that are learned,

126
00:06:17,320 --> 00:06:20,140
that are reinforcement learning programs that do all this,

127
00:06:20,140 --> 00:06:22,510
instead of us designing the details.

128
00:06:22,510 --> 00:06:25,560
During the reward function or during the whole program?

129
00:06:25,560 --> 00:06:28,150
So, this would be learning the entire reinforcement learning program.

130
00:06:28,150 --> 00:06:30,430
So, it would be, imagine,

131
00:06:30,430 --> 00:06:34,255
you have a reinforcement learning program, whatever it is,

132
00:06:34,255 --> 00:06:38,320
and you throw it out some problem and then you see how long it takes to learn.

133
00:06:38,320 --> 00:06:41,020
And then you say, well, that took a while.

134
00:06:41,020 --> 00:06:44,950
Now, let another program modify this reinforcement learning program.

135
00:06:44,950 --> 00:06:48,045
After the modification, see how fast it learns.

136
00:06:48,045 --> 00:06:49,641
If it learns more quickly,

137
00:06:49,641 --> 00:06:54,380
that was a good modification and maybe keep it and improve from there.

138
00:06:54,380 --> 00:06:57,630
Well, I see, right. Yes, and pace the direction.

139
00:06:57,630 --> 00:06:59,290
I think it has a lot to do with, maybe,

140
00:06:59,290 --> 00:07:01,510
the amount of compute that's becoming available.

141
00:07:01,510 --> 00:07:05,860
So, this would be running reinforcement learning in the inner loop.

142
00:07:05,860 --> 00:07:08,975
For us right now, we run reinforcement learning as the final thing.

143
00:07:08,975 --> 00:07:11,260
And so, the more compute we get,

144
00:07:11,260 --> 00:07:14,545
the more it becomes possible to maybe run something

145
00:07:14,545 --> 00:07:19,160
like reinforcement learning in the inner loop of a bigger algorithm.

146
00:07:19,160 --> 00:07:22,080
Starting from the 14-year-old,

147
00:07:22,080 --> 00:07:25,355
you've worked in AI for some 20 plus years now.

148
00:07:25,355 --> 00:07:32,795
So, tell me a bit about how your understanding of AI has evolved over this time.

149
00:07:32,795 --> 00:07:35,280
When I started looking at AI,

150
00:07:35,280 --> 00:07:38,230
it's very interesting because it really

151
00:07:38,230 --> 00:07:41,445
coincided with coming to Stanford to do my master's degree there,

152
00:07:41,445 --> 00:07:46,998
and there were some icons there like John McCarthy who I got to talk with,

153
00:07:46,998 --> 00:07:49,300
but who had a very different approach to,

154
00:07:49,300 --> 00:07:50,460
and in the year 2000,

155
00:07:50,460 --> 00:07:52,115
for what most people were doing at the time.

156
00:07:52,115 --> 00:07:54,958
And also talking with Daphne Koller.

157
00:07:54,958 --> 00:07:59,320
And I think a lot of my initial thinking of AI was shaped by Daphne's thinking.

158
00:07:59,320 --> 00:08:04,300
Her AI class, her probabilistic graphical models class,

159
00:08:04,300 --> 00:08:06,820
and kind of really being intrigued by

160
00:08:06,820 --> 00:08:11,450
how simply a distribution of her many random variables and then being able to condition

161
00:08:11,450 --> 00:08:14,950
on some subsets variables and draw on conclusions about others could

162
00:08:14,950 --> 00:08:19,015
actually give you so much if you can somehow make it computationally attractable,

163
00:08:19,015 --> 00:08:23,170
which was definitely the challenge to make it computable.

164
00:08:23,170 --> 00:08:25,090
And then from there,

165
00:08:25,090 --> 00:08:28,335
when I started my Ph.D. And you arrived at Stanford,

166
00:08:28,335 --> 00:08:30,910
and I think you give me a really good reality check,

167
00:08:30,910 --> 00:08:35,350
that that's not the right metric to evaluate your work by,

168
00:08:35,350 --> 00:08:38,470
and to really try to see the connection from what

169
00:08:38,470 --> 00:08:41,710
you're working on to what impact they can really have,

170
00:08:41,710 --> 00:08:46,660
what change it can make rather than what's the math that happened to be in your work.

171
00:08:46,660 --> 00:08:48,425
Right. That's amazing.

172
00:08:48,425 --> 00:08:50,685
I did not realize, I've forgotten that.

173
00:08:50,685 --> 00:08:54,267
Yes, it's actually one of the things, aside most often that people asking,

174
00:08:54,267 --> 00:09:01,090
if you going to cite only one thing that has stuck with you from Andrew's advice,

175
00:09:01,090 --> 00:09:05,995
it's making sure you can see the connection to where it's actually going to do something.

176
00:09:05,995 --> 00:09:11,332
You've had and you're continuing to have an amazing career in AI.

177
00:09:11,332 --> 00:09:14,750
So, for some of the people listening to you on video now,

178
00:09:14,750 --> 00:09:18,815
if they want to also enter or pursue a career in AI,

179
00:09:18,815 --> 00:09:20,985
what advice do you have for them?

180
00:09:20,985 --> 00:09:25,185
I think it's a really good time to get into artificial intelligence.

181
00:09:25,185 --> 00:09:28,965
If you look at the demand for people, it's so high,

182
00:09:28,965 --> 00:09:30,741
there is so many job opportunities,

183
00:09:30,741 --> 00:09:32,365
so many things you can do, researchwise,

184
00:09:32,365 --> 00:09:34,735
build new companies and so forth.

185
00:09:34,735 --> 00:09:39,240
So, I'd say yes, it's definitely a smart decision in terms of actually getting going.

186
00:09:39,240 --> 00:09:41,140
A lot of it, you can self-study,

187
00:09:41,140 --> 00:09:42,635
whether you're in school or not.

188
00:09:42,635 --> 00:09:44,150
There is a lot of online courses, for instance,

189
00:09:44,150 --> 00:09:45,585
your machine learning course,

190
00:09:45,585 --> 00:09:48,400
there is also, for example,

191
00:09:48,400 --> 00:09:52,030
Andrej Karpathy's deep learning course which has videos online,

192
00:09:52,030 --> 00:09:54,280
which is a great way to get started,

193
00:09:54,280 --> 00:09:57,460
Berkeley who has a deep reinforcement learning course

194
00:09:57,460 --> 00:09:59,260
which has all of the lectures online.

195
00:09:59,260 --> 00:10:01,235
So, those are all good places to get started.

196
00:10:01,235 --> 00:10:06,470
I think a big part of what's important is to make sure you try things yourself.

197
00:10:06,470 --> 00:10:10,055
So, not just read things or watch videos but try things out.

198
00:10:10,055 --> 00:10:14,347
With frameworks like TensorFlow,

199
00:10:14,347 --> 00:10:16,040
Chainer, Theano, PyTorch and so forth,

200
00:10:16,040 --> 00:10:17,350
I mean whatever is your favorite,

201
00:10:17,350 --> 00:10:21,980
it's very easy to get going and get something up and running very quickly.

202
00:10:21,980 --> 00:10:24,669
To get to practice yourself, right?

203
00:10:24,669 --> 00:10:27,105
With implementing and seeing what does and seeing what doesn't work.

204
00:10:27,105 --> 00:10:29,360
So, this past week there was an article in

205
00:10:29,360 --> 00:10:31,715
Mashable about a 16-year-old in United Kingdom,

206
00:10:31,715 --> 00:10:34,580
who is one of the leaders on Kaggle competitions.

207
00:10:34,580 --> 00:10:36,690
And it just said,

208
00:10:36,690 --> 00:10:39,290
he just went out and learned things,

209
00:10:39,290 --> 00:10:41,510
found things online, learned everything himself and

210
00:10:41,510 --> 00:10:44,915
never actually took any formal course per se.

211
00:10:44,915 --> 00:10:49,180
And there is a 16-year-old just being very competitive in Kaggle competition,

212
00:10:49,180 --> 00:10:50,990
so it's definitely possible.

213
00:10:50,990 --> 00:10:53,120
We live in good times.

214
00:10:53,120 --> 00:10:54,560
If people want to learn.

215
00:10:54,560 --> 00:10:55,940
Absolutely.

216
00:10:55,940 --> 00:10:57,980
One question I bet you get all sometimes

217
00:10:57,980 --> 00:11:00,160
is if someone wants to enter AI machine learning and deep learning,

218
00:11:00,160 --> 00:11:06,885
should they apply for a Ph.D. program or should they get the job with a big company?

219
00:11:06,885 --> 00:11:12,395
I think a lot of it has to do with maybe how much mentoring you can get.

220
00:11:12,395 --> 00:11:14,780
So, in a Ph.D. program,

221
00:11:14,780 --> 00:11:16,400
you're such a guaranteed,

222
00:11:16,400 --> 00:11:17,787
the job of the professor,

223
00:11:17,787 --> 00:11:18,830
who is your adviser,

224
00:11:18,830 --> 00:11:20,800
is to look out for you.

225
00:11:20,800 --> 00:11:21,950
Try to do everything they can to,

226
00:11:21,950 --> 00:11:23,565
kind of, shape you,

227
00:11:23,565 --> 00:11:28,720
help you become stronger at whatever you want to do, for example, AI.

228
00:11:28,720 --> 00:11:32,060
And so, there is a very clear dedicated person, sometimes you have two advisers.

229
00:11:32,060 --> 00:11:34,955
And that's literally their job and that's why they are professors,

230
00:11:34,955 --> 00:11:37,755
most of what they like about being professors often is helping

231
00:11:37,755 --> 00:11:41,200
shape students to become more capable at things.

232
00:11:41,200 --> 00:11:43,250
Now, it doesn't mean it's not possible at companies,

233
00:11:43,250 --> 00:11:46,730
and many companies have really good mentors and have people who love

234
00:11:46,730 --> 00:11:51,110
to help educate people who come in and strengthen them, and so forth.

235
00:11:51,110 --> 00:11:55,515
It's just, it might not be as much of a guarantee and a given,

236
00:11:55,515 --> 00:12:00,540
compared to actually enrolling in a Ph.D. program or that's the crooks of

237
00:12:00,540 --> 00:12:06,020
the program is that you're going to learn and somebody is there to help you learn.

238
00:12:06,020 --> 00:12:09,675
So it really depends on the company and depends on the Ph.D. program.

239
00:12:09,675 --> 00:12:14,130
Absolutely, yes. But I think it is key that you can learn a lot on your own.

240
00:12:14,130 --> 00:12:17,910
But I think you can learn a lot faster if you have somebody who's more experienced,

241
00:12:17,910 --> 00:12:20,469
who is actually taking it up as

242
00:12:20,469 --> 00:12:24,945
their responsibility to spend time with you and help accelerate your progress.

243
00:12:24,945 --> 00:12:28,780
So, you've been one of the most visible leaders in deep reinforcement learning.

244
00:12:28,780 --> 00:12:30,720
So, what are the things that

245
00:12:30,720 --> 00:12:32,930
deep reinforcement learning is already working really well at?

246
00:12:32,930 --> 00:12:37,450
I think, if you look at some deep reinforcement learning successes,

247
00:12:37,450 --> 00:12:39,000
it's very, very intriguing.

248
00:12:39,000 --> 00:12:42,810
For example, learning to play Atari games from pixels,

249
00:12:42,810 --> 00:12:45,540
processing this pixels which is just numbers that are being

250
00:12:45,540 --> 00:12:49,150
processed somehow and turned into joystick actions.

251
00:12:49,150 --> 00:12:52,605
Then, for example, some of the work we did at Berkeley was,

252
00:12:52,605 --> 00:12:57,105
we have a simulated robot inventing walking and the reward

253
00:12:57,105 --> 00:12:59,340
that it's given is as simple as the further you go north the

254
00:12:59,340 --> 00:13:02,170
better and the less hard you impact with the ground the better.

255
00:13:02,170 --> 00:13:06,949
And somehow it decides that walking slash running is the thing to invent whereas,

256
00:13:06,949 --> 00:13:10,095
nobody showed it, what walking is or running is.

257
00:13:10,095 --> 00:13:14,220
Or robot playing with children's stories and learn to kind of put them together,

258
00:13:14,220 --> 00:13:16,935
put a block into matching opening, and so forth.

259
00:13:16,935 --> 00:13:20,280
And so, I think it's really interesting that in all of these it's possible to learn

260
00:13:20,280 --> 00:13:24,510
from raw sensory inputs all the way to raw controls,

261
00:13:24,510 --> 00:13:27,990
for example, torques at the motors.

262
00:13:27,990 --> 00:13:29,225
But at the same time.

263
00:13:29,225 --> 00:13:32,460
So it is very interesting that you can have a single algorithm.

264
00:13:32,460 --> 00:13:35,310
For example, you know thrust is impulsive and you can learn,

265
00:13:35,310 --> 00:13:36,745
can have a robot learn to run,

266
00:13:36,745 --> 00:13:38,135
can have a robot learn to stand up,

267
00:13:38,135 --> 00:13:40,395
can have instead of a two legged robot,

268
00:13:40,395 --> 00:13:42,445
now you're swapping a four legged robot.

269
00:13:42,445 --> 00:13:46,465
You run the same reinforcement algorithm and it still learns to run.

270
00:13:46,465 --> 00:13:49,280
And so, there is no change in the reinforcement algorithm.

271
00:13:49,280 --> 00:13:51,615
It's very, very general. Same for the Atari games.

272
00:13:51,615 --> 00:13:54,565
DQN was the same DQN for every one of the games.

273
00:13:54,565 --> 00:13:56,640
But then, when it actually starts hitting

274
00:13:56,640 --> 00:14:00,060
the frontiers of what's not yet possible as well,

275
00:14:00,060 --> 00:14:03,490
it's nice it learns from scratch for each one of

276
00:14:03,490 --> 00:14:07,405
these tasks but would be even nicer if it could reuse things it's learned in the past;

277
00:14:07,405 --> 00:14:09,640
to learn even more quickly for the next task.

278
00:14:09,640 --> 00:14:13,100
And that's something that's still on the frontier and not yet possible.

279
00:14:13,100 --> 00:14:16,490
It always starts from scratch, essentially.

280
00:14:16,490 --> 00:14:19,390
How quickly, do you think, you see deep

281
00:14:19,390 --> 00:14:22,420
reinforcement learning get deployed in the robots around us,

282
00:14:22,420 --> 00:14:25,935
the robots they're getting deployed in the world today.

283
00:14:25,935 --> 00:14:29,380
I think in practice the realistic scenario is one

284
00:14:29,380 --> 00:14:32,770
where it starts with supervised learning,

285
00:14:32,770 --> 00:14:35,960
behavioral cloning; humans do the work.

286
00:14:35,960 --> 00:14:38,530
And I think a lot of businesses will be built

287
00:14:38,530 --> 00:14:41,790
that way where it's a human behind the scenes doing a lot of the work.

288
00:14:41,790 --> 00:14:44,980
Imagine Facebook Messenger assistant.

289
00:14:44,980 --> 00:14:47,980
Assistant like that could be built with a human behind

290
00:14:47,980 --> 00:14:51,310
the curtains doing a lot of the work; machine learning,

291
00:14:51,310 --> 00:14:54,380
matches up with what the human does and starts making suggestions to

292
00:14:54,380 --> 00:14:58,130
human so the humans has a small number of options that we can just click and select.

293
00:14:58,130 --> 00:14:59,640
And then over time,

294
00:14:59,640 --> 00:15:01,130
as it gets pretty good,

295
00:15:01,130 --> 00:15:04,465
you're starting fusing some reinforcement learning where you give it actual objectives,

296
00:15:04,465 --> 00:15:06,565
not just matching the human behind the curtains

297
00:15:06,565 --> 00:15:09,040
but giving objectives of achievement like,

298
00:15:09,040 --> 00:15:14,110
maybe, how fast were these two people able to plan their meeting?

299
00:15:14,110 --> 00:15:16,385
Or how fast were they able to book their flight?

300
00:15:16,385 --> 00:15:18,340
Or things like that. How long did it take?

301
00:15:18,340 --> 00:15:20,065
How happy were they with it?

302
00:15:20,065 --> 00:15:22,815
But it would probably have to be bootstrap of a lot of

303
00:15:22,815 --> 00:15:27,605
behavioral cloning of humans showing how this could be done.

304
00:15:27,605 --> 00:15:30,690
So it sounds behavioral cloning just supervise learning to

305
00:15:30,690 --> 00:15:33,580
mimic whatever the person is doing and then gradually later on,

306
00:15:33,580 --> 00:15:37,434
the reinforcement learning to have it think about longer time horizons?

307
00:15:37,434 --> 00:15:38,500
Is that a fair summary?

308
00:15:38,500 --> 00:15:39,715
I'd say so, yes.

309
00:15:39,715 --> 00:15:43,540
Just because straight up reinforcement learning from scratch is really fun to watch.

310
00:15:43,540 --> 00:15:46,780
It's super intriguing and very few things more fun to watch

311
00:15:46,780 --> 00:15:50,440
than a reinforcement learning robot starting from nothing and inventing things.

312
00:15:50,440 --> 00:15:54,280
But it's just time consuming and it's not always safe.

313
00:15:54,280 --> 00:15:56,200
Thank you very much. That was fascinating.

314
00:15:56,200 --> 00:15:58,005
I'm really glad we had the chance to chat.

315
00:15:58,005 --> 00:16:02,670
Well, Andrew thank you for having me. Very much appreciate it.