1
00:00:00,000 --> 00:00:04,051
In this video, we're going to look at the
soft max output function.

2
00:00:04,051 --> 00:00:10,000
This is a way of forcing the outputs of a
neural network to sum to one so they can

3
00:00:10,000 --> 00:00:14,075
represent a probability distribution
across discreet mutually exclusive

4
00:00:14,075 --> 00:00:18,038
alternatives.
Before we get back to the issue of how we

5
00:00:18,038 --> 00:00:23,086
learn feature vectors to represent words,
we're gonna have one more digression, this

6
00:00:23,086 --> 00:00:28,075
time it's a technical diversion.
So far I talked about using a square area

7
00:00:28,075 --> 00:00:34,017
measure for training a neural net and for
linear neurons it's a sensible thing to

8
00:00:34,017 --> 00:00:37,039
do.
But the squared error measure has some

9
00:00:37,039 --> 00:00:41,020
drawbacks.
If for example the design acuities are

10
00:00:41,020 --> 00:00:46,088
one, so you have a target of one, and the
actual output of a neuron is one

11
00:00:46,088 --> 00:00:52,602
billionth, then there's almost no gradient
to allow a logistic unit to change.

12
00:00:52,602 --> 00:00:58,038
It's way out on a plateau where the slope
is almost exactly horizantal.

13
00:00:58,038 --> 00:01:03,080
And so, it will take a very, very long
time to change its weights, even though

14
00:01:03,080 --> 00:01:08,009
it's making almost as big an error as it's
possible to make.

15
00:01:08,009 --> 00:01:13,030
Also, if we're trying to assign
probabilities to mutually exclusive class

16
00:01:13,030 --> 00:01:16,080
labels, we know that the output should sum
to one.

17
00:01:16,080 --> 00:01:21,079
Any answer in which we say, the
probability this is A is three quarters

18
00:01:21,079 --> 00:01:27,043
and the probability that it's a B is also
three quarters is just a crazy answer.

19
00:01:27,043 --> 00:01:31,695
And we ought to tell the network that
information, we shouldn't deprive it of

20
00:01:31,695 --> 00:01:35,087
the knowledge that these are mutually
exclusive answers.

21
00:01:35,087 --> 00:01:40,070
So the question is, is there a different
cost function that will work better?

22
00:01:40,070 --> 00:01:45,077
Is there a way of telling it that these
are mutually exclusive and then using a,

23
00:01:45,077 --> 00:01:50,043
an appropriate cost function?
The answer, of course is, that there is.

24
00:01:50,087 --> 00:01:56,075
What we need to do is force the outputs of
the neural net to represent a probability

25
00:01:56,075 --> 00:02:02,035
distribution across discrete alternatives,
if that's what we plan to use them for.

26
00:02:02,035 --> 00:02:06,043
The way we do this is by using something
called a soft-max.

27
00:02:06,043 --> 00:02:10,078
It's a kind of soft continuous version of
the maximum function.

28
00:02:10,078 --> 00:02:16,052
So the way the units in a soft-max group
work is that they each receive some total

29
00:02:16,052 --> 00:02:19,076
input they've accumulated from the layer
below.

30
00:02:19,076 --> 00:02:23,063
That's Zi for the i-th unit, and that's
called the logit.

31
00:02:23,063 --> 00:02:29,075
And then they give an output Yi that
doesn't just depend on their own Zi.

32
00:02:29,075 --> 00:02:34,053
It depends on the Zs accumulated by their
rivals as well.

33
00:02:34,053 --> 00:02:41,006
So we say that the output of the i-th
neuron is E to the Zi divided by the sum

34
00:02:41,006 --> 00:02:47,018
of that same quantity for all the
different neurons in the softmax group.

35
00:02:47,047 --> 00:02:52,086
And because the bottom line of that
equation is the sum of the top line over

36
00:02:52,086 --> 00:02:58,060
all possibilities, we know that when you
add over all possibilities you'll get one.

37
00:02:58,060 --> 00:03:02,003
That is, the sum of all the Yi's must come
to one.

38
00:03:02,003 --> 00:03:05,081
What's more, the Yi's have to lie between
zero and one.

39
00:03:05,081 --> 00:03:11,076
So we force the Yi to represent a
probability distribution over mutually

40
00:03:11,076 --> 00:03:16,059
exclusive alternatives just by using that
soft max equation.

41
00:03:16,059 --> 00:03:20,069
The soft max equation has a nice simple
derivative.

42
00:03:20,069 --> 00:03:27,098
If you ask about how the YI changes as you
change the Zi, that obviously depends on,

43
00:03:27,098 --> 00:03:33,025
all the other Zs.
But then the Yi itself depends on all the

44
00:03:33,025 --> 00:03:37,038
other Zs.
And it turns out, that you get a nice

45
00:03:37,038 --> 00:03:44,032
simple form, just like you do for the
majestic unit, where the derivative of the

46
00:03:44,032 --> 00:03:51,034
output with respect to the input, for an
individual neuron in a softmax group, is

47
00:03:51,034 --> 00:03:57,017
just Yi times one minus Yi.
It's not totally trivial to derive that.

48
00:03:57,017 --> 00:04:03,018
If you tried differentiating the equation
above, you must remember that things turn

49
00:04:03,018 --> 00:04:06,066
up in that normalization term on the
bottom row.

50
00:04:06,066 --> 00:04:11,015
It's very easy to forget those terms and
get the wrong answer.

51
00:04:12,097 --> 00:04:19,034
Now the question is, if we're using a soft
max group for the outputs, what's the

52
00:04:19,034 --> 00:04:24,090
right cost function?
And the answer, as usual, is that the most

53
00:04:24,090 --> 00:04:30,069
appropriate cost function is the negative
log probability of the correct answer.

54
00:04:30,069 --> 00:04:36,018
That is, we want to maximize the log
probability of getting the answer right.

55
00:04:36,018 --> 00:04:41,097
So if one of the target values is a one
and the remaining ones are zero, then we

56
00:04:41,097 --> 00:04:47,033
simply sum of all possible answers.
We put zeros in front of all the wrong

57
00:04:47,033 --> 00:04:50,002
answers.
And we put one in front of the right

58
00:04:50,002 --> 00:04:54,076
answer and that gets us the negative log
probability of the correct answer, as you

59
00:04:54,076 --> 00:05:00,071
can see in the equation.
That's called the cross entropy cost

60
00:05:00,071 --> 00:05:06,037
function.
It has a nice property that it has a very

61
00:05:06,037 --> 00:05:12,001
big gradient when the target value is one
and the output is almost zero.

62
00:05:12,001 --> 00:05:15,093
You can see that by considering a couple
of cases.

63
00:05:17,036 --> 00:05:22,065
So value of one in a million is much
better than a value of one in a billion,

64
00:05:22,065 --> 00:05:25,095
even though it differs by less than a
millionth.

65
00:05:25,095 --> 00:05:31,011
So when you make the output value, you
increase by less than one millionth.

66
00:05:31,011 --> 00:05:35,072
The value of C improves by a lot.
That means it's a very, very steep

67
00:05:35,072 --> 00:05:39,064
gradient for C.
One way of seeing why a value of one in a

68
00:05:39,064 --> 00:05:45,225
million is much better than a value of one
in a billion, if the correct answer is one

69
00:05:45,225 --> 00:05:49,479
is that if you believe the one in a
million, you'd be willing to bet but odds

70
00:05:49,479 --> 00:05:52,093
of one in a million, then you'd lose $one
million.

71
00:05:52,093 --> 00:05:57,512
If you thought the answer was one in a one
billion you'd, you'd lose $one billion

72
00:05:57,512 --> 00:06:00,092
making the same bet.
So we get a nice property that.

73
00:06:01,030 --> 00:06:07,680
That cost function, C has a very steep
derivative when the answer is very wrong

74
00:06:07,680 --> 00:06:14,707
and that exactly bounces the fact that the
way which the advert changes is to change

75
00:06:14,707 --> 00:06:20,072
the import, the Y or the Z is very flat
when the once is very wrong.

76
00:06:20,072 --> 00:06:27,076
And when you multiply the two together to
get the derivative of cross entropy with

77
00:06:27,076 --> 00:06:30,998
respect to the logic going into i put unit
i.

78
00:06:30,998 --> 00:06:38,142
You use the chain rule so that derivative
is how fast the cost function changes as

79
00:06:38,142 --> 00:06:46,283
you change the output of the unit times
how fast the output of the unit changes as

80
00:06:46,283 --> 00:06:50,421
you change Zi.
And notice we need to add up across all

81
00:06:50,421 --> 00:06:57,651
the Js, because when you change the i, the
output of all the different units changes.

82
00:06:57,651 --> 00:07:02,351
The result is just the actual output minus
the target output.

83
00:07:02,351 --> 00:07:07,601
And you can see that when the actual
target outputs are very different, that

84
00:07:07,601 --> 00:07:11,842
has a slope of one or -one.
And the slope is never bigger than one or

85
00:07:11,842 --> 00:07:14,191
-one.
But the slope never gets small until the

86
00:07:14,191 --> 00:07:18,913
two things are pretty much the same.
In other words, you're getting pretty much

87
00:07:18,913 --> 00:07:20,055
the right answer.