1
00:00:00,000 --> 00:00:04,052
In this video, I'm gonna talk about three
different types of machine learning:

2
00:00:04,052 --> 00:00:08,057
supervised learning, reinforcement
learning and unsupervised learning.

3
00:00:08,057 --> 00:00:13,027
Broadly speaking, the first half of the
course will be about supervised learning.

4
00:00:13,027 --> 00:00:17,079
The second half of the course will be
mainly about unsupervised learning, and

5
00:00:17,079 --> 00:00:22,049
reinforcement learning will not be covered
in the course, because we can't cover

6
00:00:22,049 --> 00:00:26,060
everything.
Learning can be divided into three broad

7
00:00:26,060 --> 00:00:30,067
groups of algorithms.
In supervised learning, you're trying to

8
00:00:30,067 --> 00:00:35,092
predict an output when given an input
vector, so it's very clear what the point

9
00:00:35,092 --> 00:00:41,017
of supervised learning is.
In reinforcement level, you're trying to

10
00:00:41,017 --> 00:00:46,607
select actions or sequences of actions to
maximize the rewards you get, and the

11
00:00:46,607 --> 00:00:53,030
rewards may only occur occasionally.
In unsupervised learning you're trying to

12
00:00:53,030 --> 00:00:59,577
discover a good internal representation of
the input and we'll come later to what

13
00:00:59,577 --> 00:01:03,795
that might mean.
Supervised learning itself comes in two

14
00:01:03,795 --> 00:01:08,121
different flavors.
In regression, the target output is a real

15
00:01:08,121 --> 00:01:14,135
number or a whole vector of real numbers,
such as the price of a stock in six months

16
00:01:14,135 --> 00:01:21,059
time, or the temperature at noon tomorrow.
And the aim is to get as close as you can

17
00:01:21,059 --> 00:01:25,399
to the correct real number.
In classification, the target output is a

18
00:01:25,399 --> 00:01:29,364
class label.
The simplest case is a choice between one

19
00:01:29,364 --> 00:01:32,606
and zero.
Between positive and negative cases.

20
00:01:32,606 --> 00:01:37,636
But obviously, we can have multiple
alternative labels as when we're

21
00:01:37,636 --> 00:01:44,492
classifying handwritten digits.
Supervised learning works by initially

22
00:01:44,492 --> 00:01:49,512
selecting a model class, that is, a whole
set of models that we're prepared to

23
00:01:49,512 --> 00:01:53,422
consider as candidates.
You can think of a model class as a

24
00:01:53,422 --> 00:01:59,825
function that takes an input vector and
some parameters and gives you an output y.

25
00:01:59,825 --> 00:02:03,836
So a model class is simply a way of
mapping.

26
00:02:03,836 --> 00:02:10,939
An input to an output using some numerical
parameters W and then we adjust these

27
00:02:10,939 --> 00:02:16,394
numerical parameters to make the mapping
fit the supervised training data.

28
00:02:16,394 --> 00:02:22,046
What we mean by fit is minimizing the
discrepancy between the target output on

29
00:02:22,046 --> 00:02:27,255
each training case and the actual output
produced by a machine learning system.

30
00:02:27,255 --> 00:02:32,591
And an obvious measure of that
discrepancy, if we're using real values as

31
00:02:32,591 --> 00:02:38,746
outputs, is the square difference between
the output from our system y, and the

32
00:02:38,746 --> 00:02:44,057
correct output t, and we put in that
one-half, so it cancels the two when we

33
00:02:44,057 --> 00:02:47,453
differentiate.
For classification you could use that

34
00:02:47,453 --> 00:02:51,994
measure, but there's other more sensiblbe
measures which we'll come to later, and

35
00:02:51,994 --> 00:02:56,203
these more sensibile measures typically
work better as well.

36
00:02:56,203 --> 00:03:03,055
In reinforcement learning, the outputs an
actual sequence of actions, and you have

37
00:03:03,055 --> 00:03:07,080
to decide on those actions based on
occasional rewards.

38
00:03:07,080 --> 00:03:12,516
The goal in selecting each action is to
maximize the expected sum of the future

39
00:03:12,516 --> 00:03:17,139
rewards, and we typically use a discount
factor so that you don't have to look too

40
00:03:17,139 --> 00:03:20,472
far in the future.
We say that rewards far in the future

41
00:03:20,472 --> 00:03:24,592
don't count for as much as rewards that
you get fairly quickly.

42
00:03:24,592 --> 00:03:29,538
Reinforcement learning is difficult.
It's difficult because the rewards are

43
00:03:29,538 --> 00:03:34,451
typically delayed, so it's hard to know
exactly which action was the wrong one in

44
00:03:34,451 --> 00:03:38,007
a long sequence of actions.
It's also difficult because a scalar

45
00:03:38,007 --> 00:03:41,879
award, especially one that only occurs
occasionally, does not supply much

46
00:03:41,879 --> 00:03:45,082
information, on which to base the changes
in parameters.

47
00:03:45,082 --> 00:03:50,235
So typically, you can't learn millions of
parameters using reinforcement learning.

48
00:03:50,235 --> 00:03:53,830
Whereas supervised learning and
unsupervised learning, you can.

49
00:03:53,830 --> 00:03:57,798
Typically, in reinforcement learning,
you're trying to learn dozens of

50
00:03:57,798 --> 00:04:00,755
parameters or maybe 1,000 parameters, but
not millions.

51
00:04:00,755 --> 00:04:04,827
In this course, we can't cover everything,
and so we're not going to cover

52
00:04:04,827 --> 00:04:08,552
reinforcement learning, even though it's
an important topic.

53
00:04:08,552 --> 00:04:14,350
Unsupervised learning, is going to be
covered in the second half of the course.

54
00:04:14,350 --> 00:04:20,040
For about 40 years, the machine learning
community basically ignored unsupervised

55
00:04:20,040 --> 00:04:24,282
learning except for one very limited form
called clustering.

56
00:04:24,282 --> 00:04:28,990
In fact, they used definitions of machine
learning that excluded it.

57
00:04:28,990 --> 00:04:34,481
So they defined machine learning, in some
textbooks, as mapping from inputs to

58
00:04:34,481 --> 00:04:37,589
outputs.
And many researchers thought that

59
00:04:37,589 --> 00:04:40,822
clustering was the only form of
unsupervised learning.

60
00:04:40,822 --> 00:04:46,870
One reason for this is that it's hard to
say what the aim of unsupervised learning

61
00:04:46,870 --> 00:04:50,518
is.
One major aim is to get an internal

62
00:04:50,518 --> 00:04:54,879
representation of the input, that is
useful for subsequent supervised or

63
00:04:54,879 --> 00:04:59,188
reinforcement learning.
And the reason we might want to do that in

64
00:04:59,188 --> 00:05:04,481
two stages, is we don't want to use, for
example, the payoffs from reinforcement

65
00:05:04,481 --> 00:05:08,503
learning, in order to set the parameters,
for our visual system.

66
00:05:08,503 --> 00:05:13,310
So you can compute the distance to a
surface by using the disparity between the

67
00:05:13,310 --> 00:05:17,076
images you get in your two eyes.
But you don't want to learn to do that

68
00:05:17,076 --> 00:05:21,003
computation of distance by repeatedly
stubbing your toe and adjusting the

69
00:05:21,003 --> 00:05:24,566
parameters in your visual system every
time you stub your toe.

70
00:05:24,566 --> 00:05:29,100
That would involve stubbing your toe a
very large number of times and there's

71
00:05:29,100 --> 00:05:33,474
much better ways to learn to fuse two
images based purely on the information in

72
00:05:33,474 --> 00:05:37,799
the inputs.
Other goals for unsupervised learning are

73
00:05:37,799 --> 00:05:42,194
to provide compact, low dimensional
representations of the input.

74
00:05:42,194 --> 00:05:47,149
So, high-dimensional inputs like images,
typically, live on or near a

75
00:05:47,149 --> 00:05:51,599
low-dimensional manifold.
Or several such manifolds in the case of

76
00:05:51,599 --> 00:05:55,584
the handwritten digits.
What that means is, even if you have a

77
00:05:55,584 --> 00:06:00,605
million pixels, there aren't really a
million degrees of freedom in what can

78
00:06:00,605 --> 00:06:04,118
happen.
There may only be a few hundred degrees of

79
00:06:04,118 --> 00:06:08,025
freedom in what can happen.
So what we want to do is move from a

80
00:06:08,025 --> 00:06:12,617
million pixels to a representation of
those few hundred degrees of freedom which

81
00:06:12,617 --> 00:06:15,804
will be according to saying where we are
on a manifold.

82
00:06:15,804 --> 00:06:18,342
Also we need to know which manifold we're
on.

83
00:06:18,342 --> 00:06:24,321
A very limited form of this is principle
commands analysis which is linear.

84
00:06:24,321 --> 00:06:29,064
It assumes that there's one manifold, and
the manifold is a plane in the high

85
00:06:29,064 --> 00:06:33,323
dimensional space.
Another definition of unsupervised

86
00:06:33,323 --> 00:06:37,846
learning, or another goal for unsupervised
learning, is to prov-, to provide an

87
00:06:37,846 --> 00:06:41,746
economical representation for the input in
terms of learned features.

88
00:06:41,746 --> 00:06:46,605
If, for example, we can represent the
input in terms of binary features, that's

89
00:06:46,605 --> 00:06:51,552
typically economical cuz then it takes
only one bit to say the state of a binary

90
00:06:51,552 --> 00:06:54,600
feature.
Alternatively we could use a large number

91
00:06:54,600 --> 00:06:59,330
of real valued features but insist that
for each input almost all of those

92
00:06:59,330 --> 00:07:03,481
features are exactly zero.
In that case for each input we only need

93
00:07:03,481 --> 00:07:07,107
to represent a few real numbers and that's
economical.

94
00:07:07,107 --> 00:07:13,711
As I mentioned before, another definition
of unsupervised learning or another goal

95
00:07:13,711 --> 00:07:18,543
of unsupervised learning is to find
clusters in the input, and clustering

96
00:07:18,543 --> 00:07:23,969
could be viewed as a very sparse code,
that is we have one feature per cluster

97
00:07:23,969 --> 00:07:30,062
and we insist that all the features except
one are zero and that one feature has a

98
00:07:30,062 --> 00:07:33,814
value of one.
So clustering is really just an extreme

99
00:07:33,814 --> 00:07:36,037
case of finding sparse features.