1
00:00:00,000 --> 00:00:06,321
So now, we'll go beyond least squares.
However, let me mention that the technique

2
00:00:06,321 --> 00:00:12,230
that we're going to cover this week is
least squares, which have already been

3
00:00:12,230 --> 00:00:18,062
covered. And, what we will follow is
essentially a broad overview of a variety

4
00:00:18,062 --> 00:00:24,047
of different areas dealing with different
types of prediction models and their

5
00:00:24,047 --> 00:00:29,880
relationships which are increasingly
becoming apparent in research community.

6
00:00:29,880 --> 00:00:35,379
Let's first see what happens if we have
categorical data. That is that, the target

7
00:00:35,379 --> 00:00:40,608
variables that we are trying to predict,
the y's, are not numbers but yes is a

8
00:00:40,608 --> 00:00:45,157
node, which is essentially the learning
problem that we had earlier.

9
00:00:45,157 --> 00:00:50,385
Well, the challenge here is that the x's
are still numbers. So, it might not be

10
00:00:50,385 --> 00:00:56,021
such a great idea to use something like a
naive Bayes classifier because one would

11
00:00:56,021 --> 00:01:01,385
have to discretize those numbers into
categorical variables which might not be

12
00:01:01,385 --> 00:01:05,780
appropriate.
So, one tends to try to use regression

13
00:01:05,780 --> 00:01:12,039
techniques. But, using a linear regression
when all one is trying to do is separate

14
00:01:12,039 --> 00:01:17,230
out the no's which is the reds from the
blues, which may be yes's.

15
00:01:17,230 --> 00:01:23,455
It's quite brittle because the points
right near the line might get classified

16
00:01:23,455 --> 00:01:30,153
either way, and the distinctions one has
to make are very small and are subject to

17
00:01:30,153 --> 00:01:33,778
a lot of error.
The other problem, of course, is that

18
00:01:33,778 --> 00:01:40,240
there may be no such lines separating the
data and we'll come to that in a minute.

19
00:01:40,240 --> 00:01:45,447
But, just to make regression work for
categorical data, what the most common

20
00:01:45,447 --> 00:01:51,818
technique being used today is logistic
regression. Which essentially replaces f

21
00:01:51,818 --> 00:01:58,417
by a function of the form one - one over e
to the -f transposed x, instead of f

22
00:01:58,417 --> 00:02:02,908
transposed x.
Let's see how this function behaves.

23
00:02:02,908 --> 00:02:07,400
Now, suppose f transposed x is a very
large value.

24
00:02:07,880 --> 00:02:15,700
Then, even the -f transposed x is close to
zero,

25
00:02:16,200 --> 00:02:24,769
Which means that one up, one over that is
close to infinity and we have one minus of

26
00:02:24,769 --> 00:02:31,094
that, becomes a very,
As f transposed x is close to zero, F of x

27
00:02:31,094 --> 00:02:35,888
is close to zero because you have one over
one,

28
00:02:36,194 --> 00:02:43,540
And one minus of that is almost zero.
But if, as f transposed x increases,

29
00:02:44,000 --> 00:02:52,024
The denominator here rapidly decreases to
almost zero which makes the second term

30
00:02:52,024 --> 00:03:00,244
here rapidly increased to infinity, where
f transposed x very rapidly goes away from

31
00:03:00,244 --> 00:03:06,213
zero to a negative value.
On the other hand, if f transposed x is,

32
00:03:06,213 --> 00:03:14,969
becomes a negative value,
Even the x transpose x rapidly increases

33
00:03:14,969 --> 00:03:23,366
towards infinity making the second term
almost zero. So, fx) of x goes close to

34
00:03:23,366 --> 00:03:28,015
one.
So, the speed at which the second term

35
00:03:28,015 --> 00:03:34,826
deviates from one is what makes logistic
regression particularly attractive to

36
00:03:34,826 --> 00:03:41,132
separate out the yes's from the no's.
Of course, trying to fit such a function

37
00:03:41,132 --> 00:03:47,439
is not as easier as linear squares.
But I'll show how that can be done very

38
00:03:47,439 --> 00:03:52,526
soon.
The other problem with linear regression

39
00:03:52,526 --> 00:03:56,698
is that the data might not be separable
using a line.

40
00:03:56,698 --> 00:04:03,073
So, one might, for example, you might have
a parabolic relationship, as we described

41
00:04:03,073 --> 00:04:06,693
earlier.
To support more complex functions, and

42
00:04:06,693 --> 00:04:11,180
when one doesn't even know what kind of
function to have,

43
00:04:11,480 --> 00:04:16,948
The technique called support vector
machines have become very popular. And

44
00:04:16,948 --> 00:04:23,043
they're the kind of function, f, or is
parameterized by kernel parameters, and

45
00:04:23,043 --> 00:04:28,326
these parameters are also learned along
with the, the function itself.

46
00:04:28,326 --> 00:04:33,903
So, not only does the we learn the
function, but we also learn the type of

47
00:04:33,903 --> 00:04:38,306
function from the data.
And so, if it's a parabola, you learn a

48
00:04:38,306 --> 00:04:42,048
parabola.
If it's something more complicated, like a

49
00:04:42,048 --> 00:04:47,992
circle with a hole inside, rather like a
doughnut, the support vector machine will

50
00:04:47,992 --> 00:04:53,560
learn such functions as well.
The point I'm trying to make is that

51
00:04:53,560 --> 00:04:58,613
starting from linear least squares, one,
one increases the complexity of the

52
00:04:58,613 --> 00:05:03,111
functions one is trying to learn.
And we arrive at other types of

53
00:05:03,111 --> 00:05:12,700
regressions and support vector machines.
Way back in the early days of the AI, one

54
00:05:13,020 --> 00:05:19,407
found many efforts to mimic the brain,
which essentially resulted in the field of

55
00:05:19,407 --> 00:05:23,980
neural networks.
Where one tried to create structures that

56
00:05:23,980 --> 00:05:29,973
sort of look like how neurons in the brain
affect each other, based on their

57
00:05:29,973 --> 00:05:38,460
connections to other neurons.
The first neural networks created in the

58
00:05:38,460 --> 00:05:45,794
50's by McCullouch, Pitts, Rosenblatt and
others essentially, look like linear

59
00:05:45,794 --> 00:05:50,443
combinations of inputs.
So, they were pretty much like least

60
00:05:50,443 --> 00:05:55,371
squares, in the sense that we had, you
would have a neural network, which would

61
00:05:55,371 --> 00:06:00,554
say that these are neurons and the input
to this neuron is the rainfall and this

62
00:06:00,554 --> 00:06:05,994
one's the temperature. This one's the
harvest rainfall, and then we have another

63
00:06:05,994 --> 00:06:10,281
one which has just one.
And, the output would be the wine quality

64
00:06:10,281 --> 00:06:15,720
and the weights, that is the connections
between these neurons would be learned by,

65
00:06:15,720 --> 00:06:21,352
by a process of iterative refinement which
could be essentially looked upon as a way

66
00:06:21,352 --> 00:06:27,587
of solving the least squares problem.
To deal with exactly the same problems of

67
00:06:27,587 --> 00:06:34,124
classification, we found logistic
functions being included in neural

68
00:06:34,124 --> 00:06:38,451
networks.
So essentially, the, the, this field was

69
00:06:38,451 --> 00:06:44,160
evolving parallely to statistical
prediction techniques.

70
00:06:44,820 --> 00:06:51,469
And interest in neural networks, sort of
wind over the years along the way, of

71
00:06:51,469 --> 00:06:58,866
course, we figured out different types of
neural networks which were multi-layer.

72
00:06:59,116 --> 00:07:04,859
And essentially, by combination of
logistic functions, least linear

73
00:07:04,859 --> 00:07:11,433
functions, and other types of functions,
one could learn more complex non-linear

74
00:07:11,433 --> 00:07:18,994
functions of the input variables.
Hidden layers were introduced which were

75
00:07:18,994 --> 00:07:22,787
essentially creating more complicated
forms of f.,

76
00:07:22,787 --> 00:07:28,903
And parametrized by the weights which
would then get learned through a process

77
00:07:28,903 --> 00:07:36,146
of optimization as we'll see shortly.
Finally, in recent years, in spite of the

78
00:07:36,146 --> 00:07:42,760
fact that interest in neural networks has
waned over the years, these have become

79
00:07:43,660 --> 00:07:48,570
interesting once again with the notion of
feedback.

80
00:07:48,570 --> 00:07:52,405
Notice that these networks that are shown
here,

81
00:07:52,405 --> 00:07:57,790
All the links go forward from the input
nodes to the output nodes.

82
00:07:57,790 --> 00:08:02,442
So, they are feed forward even though
they're multi-layer.

83
00:08:02,442 --> 00:08:08,154
But, when we start having feedback,
That is that an input, a node in the

84
00:08:08,154 --> 00:08:14,519
middle, output is fed back into the
previous nodes this starts behaving like a

85
00:08:14,519 --> 00:08:19,660
Bayesian belief network.
In particular, feedback neural networks

86
00:08:19,660 --> 00:08:26,604
display the same kind of explaining away
effect which we found in Bayesian networks

87
00:08:26,604 --> 00:08:34,141
that once you figure out that something, a
particular cause is, is more likely, other

88
00:08:34,141 --> 00:08:40,323
causes which might also contribute to this
input node become less likely.

89
00:08:40,323 --> 00:08:47,098
So, that makes this much more interesting
than the older neural networks and has

90
00:08:47,098 --> 00:08:50,740
sparked much recent interest in this
field.

91
00:08:50,740 --> 00:08:56,218
The field of deep belief networks,
Multilayer-feedback neural networks,

92
00:08:56,443 --> 00:09:01,696
Temporal neural networks,
All this are all coming together and deep

93
00:09:01,696 --> 00:09:05,148
relationships between them are being
explored.

94
00:09:05,148 --> 00:09:08,675
We'll, we'll come to these, these points
soon.

95
00:09:08,675 --> 00:09:11,827
But, for the moment, lets try to
understand,

96
00:09:11,827 --> 00:09:16,930
If one has such a complex f, how would one
learn the parameters of f?