1
00:00:00,340 --> 00:00:03,520
One of the challenges with building
machine learning systems is that there's

2
00:00:03,520 --> 00:00:06,250
so many things you could try, so
many things you could change.

3
00:00:06,250 --> 00:00:09,840
Including, for example, so
many hyperparameters you could tune.

4
00:00:10,960 --> 00:00:14,210
One of the things I've noticed is about
the most effective machine learning people

5
00:00:14,210 --> 00:00:17,440
is they're very clear-eyed
about what to tune

6
00:00:17,440 --> 00:00:20,200
in order to try to achieve one effect.

7
00:00:20,200 --> 00:00:22,842
This is a process we
call orthogonalization.

8
00:00:22,842 --> 00:00:24,120
Let me tell you what I mean.

9
00:00:25,490 --> 00:00:28,560
Here's a picture of
an old school television,

10
00:00:28,560 --> 00:00:33,820
with a lot of knobs that you could tune
to adjust the picture in various ways.

11
00:00:35,050 --> 00:00:39,880
So for these old TV sets,
maybe there was one knob to adjust how

12
00:00:39,880 --> 00:00:45,160
tall vertically your image is and
another knob to adjust how wide it is.

13
00:00:45,160 --> 00:00:49,310
Maybe another knob to adjust
how trapezoidal it is,

14
00:00:49,310 --> 00:00:52,370
another knob to adjust how much
to move the picture left and

15
00:00:52,370 --> 00:00:57,090
right, another one to adjust how much
the picture's rotated, and so on.

16
00:00:58,740 --> 00:01:03,719
And what TV designers had spent a lot of
time doing was to build the circuitry,

17
00:01:03,719 --> 00:01:06,477
really often analog circuitry back then,

18
00:01:06,477 --> 00:01:11,170
to make sure each of the knobs had
a relatively interpretable function.

19
00:01:11,170 --> 00:01:15,358
Such as one knob to tune this, one knob
to tune this, one knob to tune this,

20
00:01:15,358 --> 00:01:15,960
and so on.

21
00:01:17,840 --> 00:01:24,488
In contrast, imagine if you had a knob
that tunes 0.1 x how tall the image is,

22
00:01:24,488 --> 00:01:32,002
+ 0.3 x how wide the image is,- 1.7
x how trapezoidal the image is,

23
00:01:32,002 --> 00:01:39,010
+ 0.8 times the position of the image
on the horizontal axis, and so on.

24
00:01:39,010 --> 00:01:42,330
If you tune this knob, then the height
of the image, the width of the image,

25
00:01:42,330 --> 00:01:46,350
how trapezoidal it is, how much it shifts,
it all changes all at the same time.

26
00:01:46,350 --> 00:01:51,211
If you have a knob like that, it'd be
almost impossible to tune the TV so

27
00:01:51,211 --> 00:01:54,790
that the picture gets
centered in the display area.

28
00:01:54,790 --> 00:02:00,569
So in this context, orthogonalization
refers to that the TV designers

29
00:02:00,569 --> 00:02:06,076
had designed the knobs so that each
knob kind of does only one thing.

30
00:02:06,076 --> 00:02:09,118
And this makes it much
easier to tune the TV, so

31
00:02:09,118 --> 00:02:12,712
that the picture gets centered
where you want it to be.

32
00:02:14,032 --> 00:02:17,075
Here's another example
of orthogonalization.

33
00:02:17,075 --> 00:02:22,736
If you think about learning to drive
a car, a car has three main controls,

34
00:02:22,736 --> 00:02:28,124
which are steering, the steering
wheel decides how much you go left or

35
00:02:28,124 --> 00:02:31,170
right, acceleration, and braking.

36
00:02:31,170 --> 00:02:35,560
So these three controls, or
really one control for steering and

37
00:02:35,560 --> 00:02:38,810
another two controls for your speed.

38
00:02:38,810 --> 00:02:42,150
It makes it relatively interpretable,

39
00:02:42,150 --> 00:02:46,770
what your different actions through
different controls will do to your car.

40
00:02:46,770 --> 00:02:51,940
But now imagine if someone were to build
a car so that there was a joystick,

41
00:02:51,940 --> 00:02:56,560
where one axis of the joystick
controls 0.3 x your steering

42
00:02:56,560 --> 00:03:00,910
angle,- 0.8 x your speed.

43
00:03:00,910 --> 00:03:05,957
And you had a different
control that controls 2

44
00:03:05,957 --> 00:03:12,530
x the steering angle,
+ 0.9 x the speed of your car.

45
00:03:12,530 --> 00:03:15,140
In theory, by tuning these two knobs,

46
00:03:15,140 --> 00:03:19,072
you could get your car to steer at
the angle and at the speed you want.

47
00:03:19,072 --> 00:03:22,840
But it's much harder than if you
had just one single control for

48
00:03:22,840 --> 00:03:26,980
controlling the steering angle, and
a separate, distinct set of controls for

49
00:03:26,980 --> 00:03:28,750
controlling the speed.

50
00:03:28,750 --> 00:03:31,913
So the concept of
orthogonalization refers to that,

51
00:03:31,913 --> 00:03:35,707
if you think of one dimension of
what you want to do as controlling

52
00:03:35,707 --> 00:03:39,877
a steering angle, and another
dimension as controlling your speed.

53
00:03:39,877 --> 00:03:44,756
Then you want one knob to just affect
the steering angle as much as possible,

54
00:03:44,756 --> 00:03:49,179
and another knob, in the case of the car,
is really acceleration and

55
00:03:49,179 --> 00:03:51,634
braking, that controls your speed.

56
00:03:51,634 --> 00:03:54,564
But if you had a control
that mixes the two together,

57
00:03:54,564 --> 00:03:59,156
like a control like this one that affects
both your steering angle and your speed,

58
00:03:59,156 --> 00:04:01,752
something that changes
both at the same time,

59
00:04:01,752 --> 00:04:06,570
then it becomes much harder to set
the car to the speed and angle you want.

60
00:04:06,570 --> 00:04:11,933
And by having orthogonal, orthogonal
means at 90 degrees to each other.

61
00:04:11,933 --> 00:04:16,309
By having orthogonal controls that
are ideally aligned with the things you

62
00:04:16,309 --> 00:04:21,251
actually want to control, it makes it much
easier to tune the knobs you have to tune.

63
00:04:21,251 --> 00:04:23,939
To tune the steering wheel angle, and

64
00:04:23,939 --> 00:04:28,813
your accelerator, your braking,
to get the car to do what you want.

65
00:04:28,813 --> 00:04:31,090
So how does this relate
to machine learning?

66
00:04:32,260 --> 00:04:35,980
For a supervised learning system
to do well, you usually need to

67
00:04:35,980 --> 00:04:40,080
tune the knobs of your system to make
sure that four things hold true.

68
00:04:40,080 --> 00:04:43,930
First, is that you usually have to make
sure that you're at least doing well

69
00:04:43,930 --> 00:04:45,210
on the training set.

70
00:04:45,210 --> 00:04:50,327
So performance on the training set needs
to pass some acceptability assessment.

71
00:04:50,327 --> 00:04:52,458
For some applications,

72
00:04:52,458 --> 00:04:57,841
this might mean doing comparably
to human level performance.

73
00:04:57,841 --> 00:05:00,005
But this will depend on your application,
and

74
00:05:00,005 --> 00:05:03,400
we'll talk more about comparing to
human level performance next week.

75
00:05:04,520 --> 00:05:07,689
But after doing well on the training sets,

76
00:05:07,689 --> 00:05:12,281
you then hope that this leads to
also doing well on the dev set.

77
00:05:12,281 --> 00:05:16,520
And you then hope that this
also does well on the test set.

78
00:05:16,520 --> 00:05:20,025
And finally, you hope that doing
well on the test set on the cost

79
00:05:20,025 --> 00:05:23,544
function results in your system
performing in the real world.

80
00:05:23,544 --> 00:05:28,481
So you hope that this
resolves in happy cat

81
00:05:28,481 --> 00:05:32,590
picture app users, for example.

82
00:05:32,590 --> 00:05:37,990
So to relate back to the TV tuning
example, if the picture of your TV was

83
00:05:37,990 --> 00:05:43,040
either too wide or too narrow, you wanted
one knob to tune in order to adjust that.

84
00:05:43,040 --> 00:05:45,680
You don't want to have to carefully
adjust five different knobs,

85
00:05:45,680 --> 00:05:47,720
which also affect different things.

86
00:05:47,720 --> 00:05:52,510
You want one knob to just affect
the width of your TV image.

87
00:05:52,510 --> 00:05:57,500
So in a similar way, if your algorithm
is not fitting the training set well on

88
00:05:57,500 --> 00:06:02,540
the cost function, you want one knob,
yes, that's my attempt to draw a knob.

89
00:06:02,540 --> 00:06:05,540
Or maybe one specific set
of knobs that you can use,

90
00:06:05,540 --> 00:06:10,960
to make sure you can tune your algorithm
to make it fit well on the training set.

91
00:06:10,960 --> 00:06:15,560
So the knobs you use to tune this are,
you might train a bigger network.

92
00:06:16,730 --> 00:06:20,740
Or you might switch to a better
optimization algorithm,

93
00:06:20,740 --> 00:06:24,270
like the Adam optimization algorithm,
and so

94
00:06:24,270 --> 00:06:27,410
on, into some other options we'll
discuss later this week and next week.

95
00:06:28,440 --> 00:06:33,588
In contrast, if you find that the
algorithm is not fitting the dev set well,

96
00:06:33,588 --> 00:06:36,251
then there's a separate set of knobs.

97
00:06:36,251 --> 00:06:40,976
Yes, that's my not very artistic
rendering of another knob,

98
00:06:40,976 --> 00:06:44,465
you want to have a distinct
set of knobs to try.

99
00:06:44,465 --> 00:06:49,196
So for example, if your algorithm is not
doing well on the dev set, it's doing well

100
00:06:49,196 --> 00:06:53,455
on the training set but not on the dev
set, then you have a set of knobs around

101
00:06:53,455 --> 00:06:57,938
regularization that you can use to try
to make it satisfy the second criteria.

102
00:06:57,938 --> 00:07:01,786
So the analogy is, now that you've
tuned the width of your TV set,

103
00:07:01,786 --> 00:07:04,467
if the height of the image
isn't quite right,

104
00:07:04,467 --> 00:07:08,680
then you want a different knob in order
to tune the height of the TV image.

105
00:07:08,680 --> 00:07:13,429
And you want to do this hopefully
without affecting the width of your TV

106
00:07:13,429 --> 00:07:14,563
image too much.

107
00:07:14,563 --> 00:07:20,655
And getting a bigger training set
would be another knob you could use,

108
00:07:20,655 --> 00:07:26,758
that helps your learning algorithm
generalize better to the dev set.

109
00:07:26,758 --> 00:07:30,248
Now, having adjusted the width and
height of your TV image, well,

110
00:07:30,248 --> 00:07:32,587
what if it doesn't meet
the third criteria?

111
00:07:32,587 --> 00:07:36,880
What if you do well on the dev set but
not on the test set?

112
00:07:36,880 --> 00:07:37,840
If that happens,

113
00:07:37,840 --> 00:07:42,880
then the knob you tune is,
you probably want to get a bigger dev set.

114
00:07:42,880 --> 00:07:47,452
Because if it does well on the dev set but
not the test set, it probably means you've

115
00:07:47,452 --> 00:07:51,010
overtuned to your dev set, and you need
to go back and find a bigger dev set.

116
00:07:52,590 --> 00:07:57,630
And finally, if it does well on the test
set, but it isn't delivering to you

117
00:07:57,630 --> 00:08:04,020
a happy cat picture app user, then what
that means is that you want to go back and

118
00:08:04,020 --> 00:08:10,270
change either the dev set or
the cost function.

119
00:08:13,600 --> 00:08:18,230
Because if doing well on the test
set according to some cost function

120
00:08:18,230 --> 00:08:21,870
doesn't correspond to your algorithm doing
what you need it to do in the real world,

121
00:08:21,870 --> 00:08:27,260
then it means that either your dev test
set distribution isn't set correctly,

122
00:08:27,260 --> 00:08:30,230
or your cost function isn't
measuring the right thing.

123
00:08:30,230 --> 00:08:34,260
I know I'm going over these examples
quite quickly, but we'll go much more

124
00:08:34,260 --> 00:08:39,770
into detail on these specific knobs
later this week and next week.

125
00:08:39,770 --> 00:08:42,870
So if you aren't following all the details
right now, don't worry about it.

126
00:08:42,870 --> 00:08:46,429
But I want to give you a sense of
this orthogonalization process,

127
00:08:46,429 --> 00:08:50,184
that you want to be very clear about
which of these maybe four issues,

128
00:08:50,184 --> 00:08:53,569
the different things you could tune,
are trying to address.

129
00:08:53,569 --> 00:08:57,809
And when I train a neural network,
I tend not to use early stopping.

130
00:08:57,809 --> 00:09:00,845
It's not a bad technique,
quite a lot of people do it.

131
00:09:00,845 --> 00:09:04,450
But I personally find early
stopping difficult to think about.

132
00:09:04,450 --> 00:09:09,530
Because this is an op that simultaneously
affects how well you fit the training set,

133
00:09:09,530 --> 00:09:13,370
because if you stop early,
you fit the training set less well.

134
00:09:13,370 --> 00:09:18,610
It also simultaneously is often done
to improve your dev set performance.

135
00:09:18,610 --> 00:09:21,973
So this is one knob that
is less orthogonalized,

136
00:09:21,973 --> 00:09:25,343
because it simultaneously
affects two things.

137
00:09:25,343 --> 00:09:28,691
It's like a knob that simultaneously
affects both the width and

138
00:09:28,691 --> 00:09:30,900
the height of your TV image.

139
00:09:30,900 --> 00:09:34,285
And it doesn't mean that it's bad,
not to use, you can use it if you want.

140
00:09:34,285 --> 00:09:37,400
But when you have more
orthogonalized controls,

141
00:09:37,400 --> 00:09:40,020
such as these other ones
that I'm writing down here,

142
00:09:40,020 --> 00:09:44,260
then it just makes the process of
tuning your network much easier.

143
00:09:44,260 --> 00:09:47,655
So I hope that gives you a sense
of what orthogonalization means.

144
00:09:47,655 --> 00:09:51,645
Just like when you look at the TV image,
it's nice if you can say, my TV image

145
00:09:51,645 --> 00:09:55,343
is too wide, so I'm going to tune this
knob, or it's too tall, so I'm going to

146
00:09:55,343 --> 00:09:59,390
tune that knob, or it's too trapezoidal,
so I'm going to have to tune that knob.

147
00:09:59,390 --> 00:10:01,710
In machine learning, it's nice if
you can look at your system and

148
00:10:01,710 --> 00:10:03,430
say, this piece of it is wrong.

149
00:10:03,430 --> 00:10:06,088
It does not do well on the training set,
it does not do well on the dev set,

150
00:10:06,088 --> 00:10:08,702
it does not do well on the test set,
or it's doing well on the test set but

151
00:10:08,702 --> 00:10:09,720
just not in the real world.

152
00:10:09,720 --> 00:10:13,309
But figure out exactly what's wrong,
and then have exactly one knob, or

153
00:10:13,309 --> 00:10:17,310
a specific set of knobs that
helps to just solve that problem

154
00:10:17,310 --> 00:10:20,770
that is limiting the performance
of machine learning system.

155
00:10:20,770 --> 00:10:24,643
So what we're going to do this week and
next week is go through how to diagnose

156
00:10:24,643 --> 00:10:28,025
what exactly is the bottleneck
to your system's performance.

157
00:10:28,025 --> 00:10:32,386
As well as identify the specific set of
knobs you could use to tune your system to

158
00:10:32,386 --> 00:10:34,715
improve that aspect of its performance.

159
00:10:34,715 --> 00:10:37,900
So let's start going more into
the details of this process.