1
00:00:00,000 --> 00:00:06,507
Whatever our model,
One needs to minimize f of x minus y.

2
00:00:06,507 --> 00:00:13,770
Otherwise, the, the data, the value that
you want to predict you want to minimize

3
00:00:13,770 --> 00:00:16,726
the error between the, what you already
have.

4
00:00:16,928 --> 00:00:21,362
Real values and your predicted values for
those real values..

5
00:00:22,420 --> 00:00:28,516
Well, f is complicated, there's no formula
like the normal equations that we had for

6
00:00:28,516 --> 00:00:32,850
linear least squares.
So, all these methods, whether they are

7
00:00:33,071 --> 00:00:38,800
regression with logistic function, support
vector machines, neural networks, all

8
00:00:38,800 --> 00:00:42,400
learn the parameters through uniterative
process.

9
00:00:42,900 --> 00:00:49,886
We start with some f which is a
parameterization of the function f,

10
00:00:49,886 --> 00:00:53,171
It maybe not be linear like f transposed
x.

11
00:00:53,171 --> 00:00:58,519
It could be some other parameterization
so, you know, you could have squares, you

12
00:00:58,519 --> 00:01:02,338
could have exponentials, you can all kinds
of things.

13
00:01:02,338 --> 00:01:08,755
But, you have some parameters which tells
you what the function f for each choice of

14
00:01:08,755 --> 00:01:12,804
parameters is.
And, we need to refine it alliteratively

15
00:01:12,804 --> 00:01:19,484
to find the f which minimizes the error.
In the world of neural networks, the first

16
00:01:19,484 --> 00:01:24,563
search algorith was called back
propagation which has nothing going to do

17
00:01:24,563 --> 00:01:30,286
with the feed back, neural networks, is
just a new algorithms, it is just going to

18
00:01:30,286 --> 00:01:34,077
learn self.
While the other areas which were probably

19
00:01:34,077 --> 00:01:40,157
more mathematically founded, straight
forward optimization techniques were used

20
00:01:40,157 --> 00:01:45,635
to find the minimum of this error.
So, if we start off with a guess and we

21
00:01:45,635 --> 00:01:50,790
choose maybe a slightly different f as two
different guesses,

22
00:01:50,790 --> 00:01:57,195
One way to start minimizing this error or
moving towards a value of f which is

23
00:01:57,195 --> 00:02:02,740
likely to have lower error, is to use
something called gradient descent.

24
00:02:02,740 --> 00:02:08,520
This formula looks a bit daunting so let
me explain it in simple language.

25
00:02:09,240 --> 00:02:13,132
Epsilon of f is a function of the
parameters of f.

26
00:02:13,132 --> 00:02:19,750
So, one can measure how fast this function
changes if one changes each component of f

27
00:02:19,750 --> 00:02:25,824
separately, which is nothing but taking
the partial derivative of the function

28
00:02:25,824 --> 00:02:31,118
with respect to that component.
String all those partial derivatives

29
00:02:31,118 --> 00:02:36,880
together, and you'll get a vector which is
just this del sub f of epsilon.

30
00:02:37,340 --> 00:02:45,071
That vector tells you which direction to
move each component of f so as to reduce

31
00:02:45,071 --> 00:02:49,320
subset.
So, for example, the first component of f

32
00:02:49,760 --> 00:02:55,434
is such that if you increase it, then
epsilon decreases, then you would choose

33
00:02:55,434 --> 00:03:00,520
to increase that component of f.
If the second component is one which

34
00:03:00,960 --> 00:03:07,684
decreases epsilon, only if you decrease
that component, you will decrease f so

35
00:03:07,684 --> 00:03:12,749
then the, that particular element of del
would be negative.

36
00:03:12,749 --> 00:03:19,211
So, you'd get this vector del, computed
using your own value of f of r of f.

37
00:03:19,211 --> 00:03:26,198
That means, your previous iterations start
with f0, for example. And then, add some

38
00:03:26,198 --> 00:03:33,010
small multiple of this derivative to your
current guess to create a new guess.

39
00:03:33,010 --> 00:03:41,400
Now, you need to compute these derivatives
so you use the difference is between

40
00:03:41,583 --> 00:03:46,489
epsilon at f of i and eps, epsilon at f of
i minus one divided by, you know, f of i

41
00:03:46,489 --> 00:03:51,300
minus f of i minus one component-wise to
get estimates for this derivative and

42
00:03:51,300 --> 00:03:56,250
that's why you needed two guesses to start
with so, because you have to bootstrap

43
00:03:56,250 --> 00:03:59,305
this duration.
Now, this is a very popular and very

44
00:03:59,305 --> 00:04:03,973
simple method and you can have.
More complicated ways of finding the

45
00:04:03,973 --> 00:04:10,195
minimum value of, of epsilon by, by using
Newton's iteration where you use second

46
00:04:10,195 --> 00:04:15,217
derivatives in addition to first
derivatives and various different

47
00:04:15,217 --> 00:04:20,540
techniques in the mathematics literature
are available to you.

48
00:04:23,320 --> 00:04:30,713
This works fine if you have numbers, that
means if you have the x's are actual

49
00:04:30,713 --> 00:04:36,978
values as we've been assuming, but there
are, of course, some caveats.

50
00:04:37,187 --> 00:04:42,898
This iteration sort of works and all
situations work provided there is only,

51
00:04:42,898 --> 00:04:48,610
you know, one minima and you're actually
starting very close to that minima if, if

52
00:04:48,610 --> 00:04:52,982
there are multiple minimas.
And if there are many different minimas,

53
00:04:53,149 --> 00:04:57,490
you might get stuck in a local minima, and
not, and not really get to the real

54
00:04:57,490 --> 00:05:00,744
minimum value.
So, so you need better techniques.

55
00:05:00,744 --> 00:05:06,826
The other thing would be that the function
f might not necessarily be parametrized by

56
00:05:06,826 --> 00:05:10,573
a single set of values there may be some
constraints.

57
00:05:10,573 --> 00:05:16,725
You might have a function that says, that
you have five different lines which meet

58
00:05:16,725 --> 00:05:20,968
at, at the, the connecting points and
that's your function.

59
00:05:20,968 --> 00:05:25,070
So, you're going to try to fit five lines
to all your data.

60
00:05:25,070 --> 00:05:30,331
Then, you have constraints which will then
make this situation a little bit more

61
00:05:30,331 --> 00:05:33,620
complicated and, and, and, and, complex as
well.

62
00:05:33,620 --> 00:05:37,050
But, we won't go into those details right
now.

63
00:05:37,050 --> 00:05:43,336
The other problem is, is if x is not
numbers and if x is categorical, at least

64
00:05:43,336 --> 00:05:50,408
some of the x are categories like yes, no,
red, blue, or green, then you somehow have

65
00:05:50,408 --> 00:05:56,268
to convert them to a binary form.
So, if there are n categorical variables,

66
00:05:56,268 --> 00:06:01,184
you might want to convert, convert them
to, for example, zero or one.

67
00:06:01,407 --> 00:06:05,056
Of course, now, you're going to have many
more than n.

68
00:06:05,056 --> 00:06:10,195
So, you would say like words.
You, you have categorical variables which

69
00:06:10,195 --> 00:06:15,930
are words and you suddenly increase the
dimension of, you know, space from say,

70
00:06:15,930 --> 00:06:21,423
You're, you're talking about one word, but
words could be out of a set of millions.

71
00:06:21,423 --> 00:06:26,712
So now, you have a million dimensional
space and it's zero or one depending on

72
00:06:26,712 --> 00:06:31,188
whether that word is there.
And that tremendously complicates such

73
00:06:31,188 --> 00:06:36,885
methods and they often become unviable if
your categorical data is highly, very high

74
00:06:36,885 --> 00:06:41,700
dimension and you have to use different
techniques to learn parameters.

75
00:06:43,040 --> 00:06:50,069
You could also convert the these
techniques, these categorical variables to

76
00:06:50,069 --> 00:06:57,273
real numbers through a positive, through a
process of reversed classification, which

77
00:06:57,273 --> 00:07:04,303
we wont talk about, but essentially there,
you, you think about high, low, and medium

78
00:07:04,303 --> 00:07:11,791
as function which are not fixed as being
high, or low, or medium but could vary as

79
00:07:11,791 --> 00:07:15,560
something being 50% high, twenty percent
low, etc.

80
00:07:16,180 --> 00:07:25,292
And then, use such fuzzy membership to
convert your categorical variables to, to,

81
00:07:25,292 --> 00:07:32,918
into, into real numbers.
And then, once you have real numbers, you

82
00:07:32,918 --> 00:07:39,195
can use this technique or you could deal
with the space of categorical data itself.

83
00:07:39,195 --> 00:07:44,867
And then, there are techniques like
neighborhood search, heuristic AI search,

84
00:07:44,867 --> 00:07:50,917
genetic algorithms, and such techniques
which don't rely on converting your

85
00:07:50,917 --> 00:07:55,756
parameters into real space.
And then, you get the problem of a

86
00:07:55,756 --> 00:08:02,306
combinatorial explosion and you have to
use all kinds of techniques to figure out

87
00:08:02,306 --> 00:08:09,605
good ways of minimizing the error.
And lastly, as we have seen in previous

88
00:08:09,605 --> 00:08:17,219
lectures, probabilistic models can deal
with probabilities of variables taking on

89
00:08:17,219 --> 00:08:23,441
particular categorical values as opposed
to having to convert those to numbers or

90
00:08:23,441 --> 00:08:28,165
deal with them directly.
And these techniques have become

91
00:08:28,165 --> 00:08:33,937
increasingly powerful and, and popular,
precisely because you're able to deal

92
00:08:33,937 --> 00:08:39,185
with, with the numbers without having to
get into too much exponential

93
00:08:39,185 --> 00:08:43,514
combinatorial blowup.
Now, that's quite a mouthful and many of

94
00:08:43,514 --> 00:08:48,887
you might not have got it.
But those of you who are on the verge of

95
00:08:48,887 --> 00:08:55,349
graduate research, might want to listen to
that spill a couple of times cuz it does,

96
00:08:55,349 --> 00:09:01,883
sort of, unify different or sort of give a
bigger picture of how learning parameters

97
00:09:01,883 --> 00:09:06,240
in predictive models work and what are the
possible options.

98
00:09:06,820 --> 00:09:11,840
There are some more linkages which I'd
like to explore.

99
00:09:12,114 --> 00:09:16,365
For example,
Here we are trying to minimize the error

100
00:09:16,365 --> 00:09:20,907
in a predictive model.
In other situations, we might need to

101
00:09:20,907 --> 00:09:27,295
maximize a choice which is want to find
the best way to transport goods across the

102
00:09:27,295 --> 00:09:30,528
country,
Or the best way to allocate one's

103
00:09:30,528 --> 00:09:33,915
advertising funds across different
channels.

104
00:09:33,915 --> 00:09:40,304
So, there you're trying to maximize some
function and that's related closely to the

105
00:09:40,304 --> 00:09:46,951
problem of minimizing a prediction error.
Similar techniques, gradient-descent,

106
00:09:46,951 --> 00:09:52,122
Newton's method and various others apply
all the similar, similar problems in terms

107
00:09:52,122 --> 00:09:56,080
of categorical data,
Local minima, and constraints all apply as

108
00:09:56,080 --> 00:10:00,851
well.
Then, once one has decided where one wants

109
00:10:00,851 --> 00:10:04,176
to go, one needs to figure out how to get
there.

110
00:10:04,176 --> 00:10:09,978
So typically, if you're driving a car, for
example, you have a system and your angle

111
00:10:09,978 --> 00:10:14,931
of steering could be, say, theta.
And every time you, you, you change your

112
00:10:14,931 --> 00:10:20,308
theta, your car will move, move in a
certain direction so you'll have a, a new

113
00:10:20,308 --> 00:10:24,340
state of your system.
And the goal is to figure out which

114
00:10:24,340 --> 00:10:31,356
sequence of control actions will
ultimately minimize the difference between

115
00:10:31,356 --> 00:10:35,560
the state of your system and where you
want that state to be.

116
00:10:35,560 --> 00:10:41,236
And there could be other constraints that
you want that paths, paths to be fairly

117
00:10:41,236 --> 00:10:44,434
smooth, etc.
So, you get all those problems of

118
00:10:44,434 --> 00:10:47,827
constraints again.
Again, it's a minimization problem.

119
00:10:47,827 --> 00:10:52,916
And here, unlike in the first two cases of
minimizing the prediction error and, or

120
00:10:52,916 --> 00:10:58,332
optimizing a function each iteration is
actually executed and the response of a

121
00:10:58,332 --> 00:11:03,813
system, rather than a function kind of
abstractly is what you're dealing with.

122
00:11:03,813 --> 00:11:06,880
So, an actual control system will make a
change,

123
00:11:06,880 --> 00:11:11,402
See what the output is and make another
change again, techniques like

124
00:11:11,402 --> 00:11:16,384
gradient-descent,, Newton's method, and
all that start applying, but now you're

125
00:11:16,384 --> 00:11:19,858
controlling actions.
So, from predicting the future or

126
00:11:19,858 --> 00:11:26,967
predicting where things are going to go to
figuring out how to best utilize that

127
00:11:26,967 --> 00:11:36,137
information to optimize one's own goals.
And then, to decide how to change ones

128
00:11:36,137 --> 00:11:43,240
behavior or change ones system, so as to
get to the gold state, all these are

129
00:11:43,240 --> 00:11:47,667
tightly related.
And, I hope you've seen that the

130
00:11:47,667 --> 00:11:53,940
mathematics is related, now lets talk
about some of the applications.

131
00:11:54,320 --> 00:11:59,492
Before we do that, of course, in practice,
whether you're dealing with support vector

132
00:11:59,492 --> 00:12:04,291
machines, logistic regression, neural
networks, or what have you, you won't ever

133
00:12:04,291 --> 00:12:08,840
really have to do these things.
There are packages available that let you

134
00:12:08,840 --> 00:12:14,387
choose let, let you apply any of these
techniques without having to get into the

135
00:12:14,387 --> 00:12:18,749
math, or get into the duration.
The important thing is, of course, trying

136
00:12:18,749 --> 00:12:23,050
to choose which package to use. And, as
soon as we do the applications,

137
00:12:23,050 --> 00:12:29,841
I'll try to summarize which package to
use, what kind of techniques to use from

138
00:12:29,841 --> 00:12:33,968
my experience.
And then, we move on to more issues of

139
00:12:33,968 --> 00:12:35,774
deep belief networks.