1
00:00:00,000 --> 00:00:04,295
In this video, I'm going to talk about the
history of backpropagation.

2
00:00:05,560 --> 00:00:10,850
I'll start with where it came from in the
70s' and 80s,' and then I'll talk a bit

3
00:00:10,850 --> 00:00:16,274
about why it failed in the 90s.' That is,
why serious machine learning research has

4
00:00:16,274 --> 00:00:19,622
abandoned it.
There was a popular view of why this

5
00:00:19,622 --> 00:00:24,243
happened, and we can now see that, that
popular view was largely wrong.

6
00:00:24,243 --> 00:00:29,600
The real reasons it was abandoned were
because computers were too slow and data

7
00:00:29,600 --> 00:00:35,237
sets were too small.
I'll conclude by showing you a historical

8
00:00:35,237 --> 00:00:39,352
document.
There was a bet made between two machine

9
00:00:39,352 --> 00:00:45,016
learning researchers in 1995.
It's interesting to see what people back

10
00:00:45,016 --> 00:00:50,968
then believed and how wrong they were.
Backpropagation was invented independently

11
00:00:50,968 --> 00:00:55,993
several times in the 70s' and 80s'.
It started in the late 60s' with control

12
00:00:55,993 --> 00:01:01,500
theories called Bryson and Ho who invented
a linear version of backpropagation.

13
00:01:02,200 --> 00:01:07,237
Paul Werbos went to their lectures and
realized it could be made non-linear.

14
00:01:07,237 --> 00:01:12,540
And in his thesis in 1974, he published
what's probably the first proper version

15
00:01:12,540 --> 00:01:16,517
of backpropogation.
Rumelhart and Williams and I invented it

16
00:01:16,517 --> 00:01:19,765
in 1981 without knowing about Paul
Werbos's work.

17
00:01:19,765 --> 00:01:25,134
But we tried it out, and it didn't work
very well for the first thing we tried it

18
00:01:25,134 --> 00:01:26,924
for,
And so we abandoned it.

19
00:01:26,924 --> 00:01:30,570
David Parker invented it in 1985, and so
did Yann LeCun.

20
00:01:30,570 --> 00:01:36,095
Also in 1985, I went back and tried again
the thing that Rumelhart, Williams and I

21
00:01:36,095 --> 00:01:39,574
had abandoned and discovered it worked
pretty well.

22
00:01:39,574 --> 00:01:45,100
In 1986, we produced a paper with a really
convincing example of what it could do.

23
00:01:45,380 --> 00:01:51,170
It was clear the backpropagation had a lot
of promise for learning multiple layers of

24
00:01:51,170 --> 00:01:55,874
non-linear future detectors.
But it didn't really live up to its

25
00:01:55,874 --> 00:01:58,923
promise.
And by the late 1990s, most of the serious

26
00:01:58,923 --> 00:02:02,825
researchers in machine learning had given
up on backpropagation.

27
00:02:02,825 --> 00:02:07,215
For example, in David Mukai's textbook,
there's very little mention of it.

28
00:02:07,215 --> 00:02:11,726
It was still widely used by psychologists
for making psychological models.

29
00:02:11,726 --> 00:02:16,299
And it was also quite widely used in
practical applications, such as credit

30
00:02:16,299 --> 00:02:20,155
card fraud detection.
But in machine learning, people thought it

31
00:02:20,155 --> 00:02:22,680
had been supplanted by support vector
machines.

32
00:02:23,120 --> 00:02:28,263
The popular explanation of what happened
to backpropagation in the late 90s' was

33
00:02:28,263 --> 00:02:32,700
that it couldn't make use of multiple
layers and non-linear features.

34
00:02:33,440 --> 00:02:37,030
This wasn't true in convolutional nets,
which were the exception.

35
00:02:37,030 --> 00:02:41,349
But in general, people couldn't get feed
forward neural networks trained with

36
00:02:41,349 --> 00:02:45,613
backpropagation to do impressive things if
they had multiple hidden layers,

37
00:02:45,613 --> 00:02:49,982
Except for some toy examples.
It also did not work well in recurrent

38
00:02:49,982 --> 00:02:55,160
networks or in deep auto-encoders,
Which we'll cover in our later lecture.

39
00:02:55,160 --> 00:03:00,211
And recurrent networks were perhaps the
place where its most exciting, and so it

40
00:03:00,211 --> 00:03:05,262
was there that it was most disappointing
that people couldn't make it work well.

41
00:03:05,262 --> 00:03:10,061
Support vector machines by contrast,
worked well. They didn't require as much

42
00:03:10,061 --> 00:03:15,176
expertise to make them work, they produced
repeatable results, and they had a much

43
00:03:15,176 --> 00:03:19,560
better theory.
And they had much fancy of theory.

44
00:03:20,020 --> 00:03:23,751
So, that was the popular explanation of
what went wrong with backpropagation.

45
00:03:25,300 --> 00:03:29,597
With a more historical prospective, we can
see why really failed.

46
00:03:29,597 --> 00:03:32,618
The computers were thousands of times too
slow,

47
00:03:32,618 --> 00:03:37,855
And the label data's sets hundreds of
times too small for the regime in which

48
00:03:37,855 --> 00:03:43,674
backpropagation would really shine.
Also, the deep networks, as well as being

49
00:03:43,674 --> 00:03:49,015
too small, were not sensibly initialized.
And so, backpropagating through deep

50
00:03:49,015 --> 00:03:54,148
networks didn't work well because the
gradients tended to die, because the

51
00:03:54,148 --> 00:03:58,934
initial weights were typically too small.
These issues prevented backpropagation

52
00:03:59,767 --> 00:04:03,582
from being successful.
For tasks like vision and speech,

53
00:04:03,582 --> 00:04:08,088
They would eventually be a big win.
So, we need to distinguish between

54
00:04:08,088 --> 00:04:13,163
different kinds of machine learning task.
There's ones that are more typical of the

55
00:04:13,163 --> 00:04:15,792
kinds of things people study in
statistics,

56
00:04:15,792 --> 00:04:20,745
And ones that are more typical of the
kinds of things people study in artificial

57
00:04:20,745 --> 00:04:24,803
intelligence.
So, at the statistics end of the spectrum,

58
00:04:24,803 --> 00:04:30,729
you typically have low dimensional data.
A statistician thinks of a 100 dimensions

59
00:04:30,729 --> 00:04:34,631
as high dimensional data.
At the artificial intelligence end of the

60
00:04:34,631 --> 00:04:39,979
spectrum, things like images or
coefficients representing speech typically

61
00:04:39,979 --> 00:04:45,400
have many more than a 100 dimensions.
At the statistics end of the spectrum,

62
00:04:45,900 --> 00:04:49,020
There's usually a lot of noise in the
data.

63
00:04:49,300 --> 00:04:54,120
Whereas, in the AI end of the spectrum,
noise isn't the real problem.

64
00:04:55,180 --> 00:05:00,658
For statistics, there's often not that
much structure in the data, and what

65
00:05:00,658 --> 00:05:05,100
structure there is can be captured by a
fairly simple model.

66
00:05:05,900 --> 00:05:11,292
At the AI end of the spectrum, there's
typically a huge amount of structure in

67
00:05:11,292 --> 00:05:14,773
the data.
So if you take a set of images, its highly

68
00:05:14,773 --> 00:05:18,801
structured data.
But the structure is too complicated to be

69
00:05:18,801 --> 00:05:23,169
captured by a simple model.
So in statistics, the main problem is

70
00:05:23,169 --> 00:05:28,698
separating true structure from noise, not
thinking that noise is really structure.

71
00:05:28,698 --> 00:05:32,180
This can be done by a Bayesian neural net
pretty well.

72
00:05:32,180 --> 00:05:37,131
But for typical non-Bayesian neural nets,
it's not the kind of problem they're good

73
00:05:37,131 --> 00:05:39,637
at.
And so, for problems like that, it makes

74
00:05:39,637 --> 00:05:44,768
sense to try a support vector machine or a
method called a Gaussian process if you're

75
00:05:44,768 --> 00:05:47,930
doing regression, which I'll talk about
briefly later.

76
00:05:47,930 --> 00:05:53,049
At the artificial intelligence end of the
spectrum, the main problem is to find a

77
00:05:53,049 --> 00:05:57,916
way of representing all this complicated
structure so that it can be learned.

78
00:05:57,916 --> 00:06:02,783
The obvious thing to do is to try to hand
design appropriate representations.

79
00:06:02,783 --> 00:06:06,891
But actually, it's easier to let back
propagation figure out what

80
00:06:06,891 --> 00:06:11,378
representations to use by giving it
multiple layers and using a lot of

81
00:06:11,378 --> 00:06:15,740
computation power to let it decide what
the representation should be.

82
00:06:16,120 --> 00:06:19,926
I now want to talk very briefly about
support vector machines.

83
00:06:19,926 --> 00:06:24,837
I'm not going to explain how they work,
But I am going to say what I think their

84
00:06:24,837 --> 00:06:28,337
limitations are.
So, there's several ways in which you can

85
00:06:28,337 --> 00:06:33,003
view a support vector machine, and I'm
going to give you two different views of

86
00:06:33,003 --> 00:06:35,520
them.
According to the first view, support

87
00:06:35,520 --> 00:06:40,616
vector machines are just the reincarnation
of perceptions with a clever trick called

88
00:06:40,616 --> 00:06:46,308
the kernel trick.
So, the idea is that you take the inputs,

89
00:06:46,308 --> 00:06:53,508
You expand the raw input into a very large
lair of non-linear, but also non adaptive

90
00:06:53,508 --> 00:06:57,138
features.
So, that's just like perceptrons where you

91
00:06:57,138 --> 00:07:00,123
have this big layer of features it doesn't
learn.

92
00:07:00,123 --> 00:07:05,178
And then, you only have to learn one layer
of adaptive weights, the weights from the

93
00:07:05,178 --> 00:07:09,503
features to the decision unit.
And support vector machines have a very

94
00:07:09,503 --> 00:07:13,462
clever way of avoiding overfitting when
they learn those weights.

95
00:07:13,462 --> 00:07:18,214
They look for what's called a maximum
margin hyperplane in a high dimensional

96
00:07:18,214 --> 00:07:21,198
space,
And they can do that much more efficiently

97
00:07:21,198 --> 00:07:25,280
than you might have thought possible.
And, that's why they work well.

98
00:07:25,640 --> 00:07:30,430
The second view also views support vector
machines as a clever reincarnation of

99
00:07:30,430 --> 00:07:33,604
perceptrons.
But, it has a completely different notion

100
00:07:33,604 --> 00:07:39,080
of what kinds of features they're using.
So, according to the second view,

101
00:07:39,460 --> 00:07:44,380
Each input vector in the training set is
used to define one feature.

102
00:07:44,380 --> 00:07:50,314
I'll spell it differently to indicate it's
a completely different kind of feature

103
00:07:50,314 --> 00:07:54,656
from the first kind.
Each of these features gives a scale of

104
00:07:54,656 --> 00:08:00,663
value which involves doing a global match
between a test input and that particular

105
00:08:00,663 --> 00:08:04,715
training input.
So, roughly speaking, it's how similar the

106
00:08:04,715 --> 00:08:07,900
test input is to a particular training
case.

107
00:08:08,280 --> 00:08:13,301
Then, there's a clever way of
simultaneously finding how to weight those

108
00:08:13,301 --> 00:08:16,230
features, so as to make the right
decision,

109
00:08:16,230 --> 00:08:20,807
And also during feature selection.
That is, deciding which of those features

110
00:08:20,807 --> 00:08:23,920
not to use.
So, although these views sound extremely

111
00:08:23,920 --> 00:08:28,803
different from one another, they're just
two alternatives ways of looking at the

112
00:08:28,803 --> 00:08:31,001
same thing,
A support vector machine.

113
00:08:31,001 --> 00:08:35,167
And, in both cases,
It's using non-adaptive features and then

114
00:08:35,167 --> 00:08:40,280
one layer of adaptive weights.
And that limits to what you can do that

115
00:08:40,280 --> 00:08:43,128
way.
You can't learn multiple layers of

116
00:08:43,128 --> 00:08:46,415
representation with a support vector
machine.

117
00:08:46,415 --> 00:08:52,612
This is a historical document from 1995.
It was given to me by [unknown] and it's a

118
00:08:52,612 --> 00:08:58,379
bet between Larry Jackel, who headed the
adaptive systems research group at Bell

119
00:08:58,379 --> 00:09:01,586
Labs,
And Vladamir Vapnik, who is the leading

120
00:09:01,586 --> 00:09:06,852
proponent of support vector machines.
Larry Jackel bet that by 2000, people

121
00:09:06,852 --> 00:09:12,474
would understand why big neural nets
trained with backpropagation worked well

122
00:09:12,474 --> 00:09:16,174
on large data sets.
That is, they would understand it

123
00:09:16,174 --> 00:09:19,519
theoretically in terms of conditions and
bands.

124
00:09:19,519 --> 00:09:23,361
Vapnik bet that they wouldn't,
But he made a side bet.

125
00:09:23,361 --> 00:09:27,631
That if he was the one to figure it out,
he would win anyway.

126
00:09:27,631 --> 00:09:31,404
Vapnik in turn,
Bet that by 2005, nobody will be using big

127
00:09:31,404 --> 00:09:33,661
neural nets like that train
backpropagation.

128
00:09:34,414 --> 00:09:39,369
It turns out that they were both wrong.
The limitation to using big neural nets

129
00:09:39,369 --> 00:09:44,198
with backpropagation was not that we
didn't have a good theory and not that

130
00:09:44,198 --> 00:09:49,403
they were essentially helpless, but that
we didn't have big enough computers or big

131
00:09:49,403 --> 00:09:52,853
enough data sets.
It was a practical limitation not a

132
00:09:52,853 --> 00:09:53,606
theoretical one.