1
00:00:00,436 --> 00:00:03,390
When training a neural network, one of
the techniques that will speed up your

2
00:00:03,390 --> 00:00:06,060
training is if you normalize your inputs.

3
00:00:06,060 --> 00:00:07,730
Let's see what that means.

4
00:00:07,730 --> 00:00:10,240
Let's see if a training sets
with two input features.

5
00:00:10,240 --> 00:00:13,520
So the input features x
are two dimensional, and

6
00:00:13,520 --> 00:00:16,550
here's a scatter plot
of your training set.

7
00:00:16,550 --> 00:00:20,730
Normalizing your inputs
corresponds to two steps.

8
00:00:20,730 --> 00:00:26,270
The first is to subtract out or
to zero out the mean.

9
00:00:26,270 --> 00:00:34,140
So you set mu = 1 over M sum over I of Xi.

10
00:00:34,140 --> 00:00:39,786
So this is a vector, and then X gets set
as X- mu for every training example,

11
00:00:39,786 --> 00:00:44,571
so this means you just move
the training set until it has 0 mean.

12
00:00:44,571 --> 00:00:49,530
And then the second step is
to normalize the variances.

13
00:00:49,530 --> 00:00:54,640
So notice here that the feature X1 has
a much larger variance than the feature

14
00:00:54,640 --> 00:00:55,410
X2 here.

15
00:00:55,410 --> 00:01:00,030
So what we do is set sigma = 1 over m

16
00:01:00,030 --> 00:01:04,580
sum of Xi**2.

17
00:01:04,580 --> 00:01:07,220
I guess this is a element y squaring.

18
00:01:07,220 --> 00:01:13,040
And so now sigma squared is a vector with
the variances of each of the features,

19
00:01:13,040 --> 00:01:15,435
and notice we've already
subtracted out the mean, so

20
00:01:15,435 --> 00:01:19,600
Xi squared,
element y squared is just the variances.

21
00:01:19,600 --> 00:01:24,580
And you take each example and
divide it by this vector sigma squared.

22
00:01:24,580 --> 00:01:28,490
And so in pictures, you end up with this.

23
00:01:28,490 --> 00:01:34,785
Where now the variance of X1 and
X2 are both equal to one.

24
00:01:35,975 --> 00:01:42,627
And one tip, if you use this to scale your
training data, then use the same mu and

25
00:01:42,627 --> 00:01:47,735
sigma squared to normalize your test set,
right?

26
00:01:47,735 --> 00:01:51,015
In particular, you don't want to
normalize the training set and

27
00:01:51,015 --> 00:01:52,865
the test set differently.

28
00:01:52,865 --> 00:01:57,520
Whatever this value is and whatever
this value is, use them in these two

29
00:01:57,520 --> 00:02:02,190
formulas so that you scale your test set
in exactly the same way, rather than

30
00:02:02,190 --> 00:02:06,500
estimating mu and sigma squared separately
on your training set and test set.

31
00:02:06,500 --> 00:02:10,167
Because you want your data,
both training and test examples,

32
00:02:10,167 --> 00:02:13,831
to go through the same transformation
defined by the same mu and

33
00:02:13,831 --> 00:02:16,752
sigma squared calculated
on your training data.

34
00:02:16,752 --> 00:02:18,210
So, why do we do this?

35
00:02:18,210 --> 00:02:21,290
Why do we want to normalize
the input features?

36
00:02:21,290 --> 00:02:25,790
Recall that a cost function is
defined as written on the top right.

37
00:02:25,790 --> 00:02:31,030
It turns out that if you use unnormalized
input features, it's more likely

38
00:02:31,030 --> 00:02:35,860
that your cost function will look like
this, it's a very squished out bowl, very

39
00:02:35,860 --> 00:02:41,500
elongated cost function, where the minimum
you're trying to find is maybe over there.

40
00:02:41,500 --> 00:02:46,890
But if your features are on very different
scales, say the feature X1 ranges

41
00:02:46,890 --> 00:02:52,280
from 1 to 1,000, and
the feature X2 ranges from 0 to 1,

42
00:02:52,280 --> 00:02:56,790
then it turns out that the ratio or
the range of values for

43
00:02:56,790 --> 00:03:02,020
the parameters w1 and w2 will end
up taking on very different values.

44
00:03:02,020 --> 00:03:06,771
And so maybe these axes should be w1 and
w2, but I'll plot w and b,

45
00:03:06,771 --> 00:03:11,270
then your cost function can be
a very elongated bowl like that.

46
00:03:11,270 --> 00:03:14,440
So if you part the contours
of this function,

47
00:03:14,440 --> 00:03:17,705
you can have a very elongated
function like that.

48
00:03:17,705 --> 00:03:19,500
Whereas if you normalize the features,

49
00:03:19,500 --> 00:03:24,570
then your cost function will on
average look more symmetric.

50
00:03:24,570 --> 00:03:28,728
And if you're running gradient descent
on the cost function like the one on

51
00:03:28,728 --> 00:03:33,216
the left, then you might have to use a
very small learning rate because if you're

52
00:03:33,216 --> 00:03:37,242
here that gradient descent might need
a lot of steps to oscillate back and

53
00:03:37,242 --> 00:03:40,800
forth before it finally finds
its way to the minimum.

54
00:03:40,800 --> 00:03:45,466
Whereas if you have a more spherical
contours, then wherever you start

55
00:03:45,466 --> 00:03:49,125
gradient descent can pretty much
go straight to the minimum.

56
00:03:49,125 --> 00:03:53,665
You can take much larger steps with
gradient descent rather than needing to

57
00:03:53,665 --> 00:03:56,345
oscillate around like like
the picture on the left.

58
00:03:56,345 --> 00:04:00,250
Of course in practice w is
a high-dimensional vector, and so

59
00:04:00,250 --> 00:04:04,530
trying to plot this in 2D doesn't
convey all the intuitions correctly.

60
00:04:04,530 --> 00:04:08,220
But the rough intuition that your
cost function will be more round and

61
00:04:08,220 --> 00:04:12,510
easier to optimize when your
features are all on similar scales.

62
00:04:12,510 --> 00:04:15,600
Not from one to 1000, zero to one, but

63
00:04:15,600 --> 00:04:20,880
mostly from minus one to one or
of about similar variances of each other.

64
00:04:20,880 --> 00:04:25,630
That just makes your cost function
J easier and faster to optimize.

65
00:04:25,630 --> 00:04:30,450
In practice if one feature, say X1,
ranges from zero to one, and

66
00:04:30,450 --> 00:04:35,530
X2 ranges from minus one to one,
and X3 ranges from one to two,

67
00:04:35,530 --> 00:04:38,730
these are fairly similar ranges,
so this will work just fine.

68
00:04:38,730 --> 00:04:41,430
It's when they're on dramatically
different ranges like

69
00:04:41,430 --> 00:04:42,470
ones from 1 to a 1000, and

70
00:04:42,470 --> 00:04:46,720
the another from 0 to 1, that that really
hurts your authorization algorithm.

71
00:04:46,720 --> 00:04:50,664
But by just setting all of them to a 0
mean and say, variance 1, like we did in

72
00:04:50,664 --> 00:04:54,857
the last slide, that just guarantees that
all your features on a similar scale and

73
00:04:54,857 --> 00:04:58,290
will usually help your
learning algorithm run faster.

74
00:04:58,290 --> 00:05:01,600
So, if your input features came
from very different scales,

75
00:05:01,600 --> 00:05:03,410
maybe some features are from 0 to 1,

76
00:05:03,410 --> 00:05:08,130
some from 1 to 1,000, then it's
important to normalize your features.

77
00:05:08,130 --> 00:05:11,630
If your features came in on similar
scales, then this step is less important.

78
00:05:11,630 --> 00:05:15,235
Although performing this type of
normalization pretty much never does any

79
00:05:15,235 --> 00:05:19,170
harm, so I'll often do it anyway
if I'm not sure whether or

80
00:05:19,170 --> 00:05:21,970
not it will help with speeding
up training for your algebra.

81
00:05:22,970 --> 00:05:26,020
So that's it for
normalizing your input features.

82
00:05:26,020 --> 00:05:29,840
Next, let's keep talking about ways to
speed up the training of your new network.