1
00:00:00,578 --> 00:00:03,769
[MUSIC]

2
00:00:03,769 --> 00:00:04,997
Hi.
In this video,

3
00:00:04,997 --> 00:00:06,679
we'll discuss linear models.

4
00:00:06,679 --> 00:00:10,473
One of the simplest models in machine
learning, but linear models are building

5
00:00:10,473 --> 00:00:13,706
blocks for deep neural networks
that we will discuss in our course.

6
00:00:13,706 --> 00:00:17,546
So, they are quite important for
us and let's start with an example.

7
00:00:17,546 --> 00:00:22,598
Suppose you are given an image and the
goal is to count sea lions on this image,

8
00:00:22,598 --> 00:00:26,429
this aerial world problem
that was hosted on kegel.com.

9
00:00:26,429 --> 00:00:32,050
So we want to write a program, a function
a that takes an image as an input,

10
00:00:32,050 --> 00:00:37,327
counts sea lions on it and counts
the number of sea lions on the image.

11
00:00:37,327 --> 00:00:41,639
Of course, we can come up with some
heuristics like detect edges of

12
00:00:41,639 --> 00:00:46,263
the objects on this photograph and
try to count connected components.

13
00:00:46,263 --> 00:00:49,645
But this approach is inferior
to machine learning.

14
00:00:49,645 --> 00:00:53,050
In machine learning, we try to
collect a labelled set of images.

15
00:00:55,320 --> 00:00:58,065
So, we try to collect like 1,000 or

16
00:00:58,065 --> 00:01:02,036
maybe even 1 million of such
photographs and label them.

17
00:01:02,036 --> 00:01:03,410
We count sea lions.

18
00:01:03,410 --> 00:01:08,587
We have a grand truth for every image and
then we try to learn a function from data.

19
00:01:08,587 --> 00:01:13,400
We try to fund a function
a that fits this data the best.

20
00:01:16,440 --> 00:01:20,133
Let's give some definitions that
will be very useful for us.

21
00:01:20,133 --> 00:01:20,804
An image or

22
00:01:20,804 --> 00:01:25,878
any other object that we try to analyze
in machine learning is called an example.

23
00:01:25,878 --> 00:01:29,813
And if it's an example that we try and
model, it's a training example.

24
00:01:29,813 --> 00:01:34,669
We describe each example with deep
characteristics that we call features.

25
00:01:34,669 --> 00:01:39,641
For example, for an images, features are
intensities of every pixel on the image or

26
00:01:39,641 --> 00:01:40,701
something else.

27
00:01:40,701 --> 00:01:43,190
So, we have examples.

28
00:01:43,190 --> 00:01:46,485
And in supervised learning,
we have target values.

29
00:01:46,485 --> 00:01:49,640
We have a grand truth and
answer for each example.

30
00:01:49,640 --> 00:01:52,738
For example,
in the problem of count in sea lions,

31
00:01:52,738 --> 00:01:56,714
we have a number of sea lions for
every example, for every image.

32
00:01:56,714 --> 00:02:00,250
We denote this target values by y.

33
00:02:00,250 --> 00:02:02,886
So for example, xi.

34
00:02:02,886 --> 00:02:06,127
The target value is yi.

35
00:02:06,127 --> 00:02:11,484
As I said in machine learning, we tried
to collect a set of label examples.

36
00:02:11,484 --> 00:02:16,450
We denoted by X large and
it's a set of pairs of L pairs that have

37
00:02:16,450 --> 00:02:21,433
an example with its feature description,
and target value.

38
00:02:21,433 --> 00:02:24,359
And finally, we want to find a modal,

39
00:02:24,359 --> 00:02:28,025
a function that maps
examples to ta get values.

40
00:02:28,025 --> 00:02:33,783
We denoted by a(x) model or
hypothesis and the goal of machine

41
00:02:33,783 --> 00:02:40,204
learning is to find a modal that fits
the training set x by the best way.

42
00:02:40,204 --> 00:02:44,531
There are two main classes of
supervised learning problems,

43
00:02:44,531 --> 00:02:46,952
regression and classification.

44
00:02:46,952 --> 00:02:50,091
In regression,
the target value is a real value.

45
00:02:50,091 --> 00:02:53,859
For example, if we count sea lions,
the target value is real.

46
00:02:53,859 --> 00:02:55,795
Actually, it's natural numbers.

47
00:02:55,795 --> 00:03:00,817
But it's also regression or, for
example, if given a job description and

48
00:03:00,817 --> 00:03:04,498
we try to predict what salary
will be given on this job.

49
00:03:04,498 --> 00:03:08,950
That's also regression since
salary is a real value.

50
00:03:08,950 --> 00:03:14,013
Or for example, if we're given movie
review from some user who tried

51
00:03:14,013 --> 00:03:19,964
to determine what rating will users give
to this movie on a scale from one to five.

52
00:03:19,964 --> 00:03:23,165
It's also can be solved
as a regression problem.

53
00:03:23,165 --> 00:03:26,986
On the other hand, if the number
of target failures is finite,

54
00:03:26,986 --> 00:03:28,872
it's a classification task.

55
00:03:28,872 --> 00:03:33,815
For example, if we want to recognize
some objects on images, for example,

56
00:03:33,815 --> 00:03:37,598
we want to find out whether there
are cats or dogs or grass or

57
00:03:37,598 --> 00:03:40,091
maybe clouds or bicycle on the image.

58
00:03:40,091 --> 00:03:41,745
It's an object recognition task.

59
00:03:41,745 --> 00:03:45,331
Since number of answers is finite,
there is finite number of objects,

60
00:03:45,331 --> 00:03:48,010
then we are solving classification task.

61
00:03:48,010 --> 00:03:50,830
Or for example,
if we are analyzing news articles and

62
00:03:50,830 --> 00:03:55,540
want to find out what topic this article
belongs to, is it about politics or

63
00:03:55,540 --> 00:03:59,460
sports or entertainment,
then it's also classification tasks since

64
00:03:59,460 --> 00:04:02,010
number of target values is once again,
finite.

65
00:04:04,440 --> 00:04:07,570
Let's discuss this very simple dataset.

66
00:04:07,570 --> 00:04:11,820
Each object, each example is
described with one feature and

67
00:04:11,820 --> 00:04:13,831
you have real value target.

68
00:04:13,831 --> 00:04:19,106
Here is the dataset, so
we can see that there is a linear trend.

69
00:04:19,106 --> 00:04:21,416
If feature increases two times,

70
00:04:21,416 --> 00:04:25,320
then target decreases
somewhere about two times.

71
00:04:25,320 --> 00:04:29,376
So maybe we could use some linear
model to describe this data,

72
00:04:29,376 --> 00:04:31,328
to build a predictive model.

73
00:04:31,328 --> 00:04:33,398
Here's linear model.

74
00:04:33,398 --> 00:04:38,303
It's very simple and
has just two parameters, w1 and w0.

75
00:04:38,303 --> 00:04:44,044
And if you find best weights, w1 and w0,
then we'll have a model like this one.

76
00:04:44,044 --> 00:04:45,433
It describes data very well.

77
00:04:45,433 --> 00:04:46,401
It isn't perfect.

78
00:04:46,401 --> 00:04:51,186
It doesn't predict the exact
target value for each example, but

79
00:04:51,186 --> 00:04:53,280
it fits the data quite well.

80
00:04:53,280 --> 00:04:57,418
Of course, in most machine learning tasks,
there are many features.

81
00:04:57,418 --> 00:05:01,370
So, we can use a generic
linear model like this one.

82
00:05:01,370 --> 00:05:08,160
So it takes each feature, x,
j multiplies it by weight wj.

83
00:05:08,160 --> 00:05:14,100
Sums these multiplicates of all the
features and then adds a biased term, b.

84
00:05:15,350 --> 00:05:16,719
This is a linear model.

85
00:05:16,719 --> 00:05:23,565
It has d+1 parameters where d is
the number of features in our dataset.

86
00:05:23,565 --> 00:05:28,916
There are d weights or
coefficients and one bias term, b.

87
00:05:28,916 --> 00:05:29,834
It's a very simple model.

88
00:05:29,834 --> 00:05:33,345
Because for example, neural networks
have much more parameters for

89
00:05:33,345 --> 00:05:35,510
the same number of features.

90
00:05:35,510 --> 00:05:39,614
And to make it even simpler,
we'll suppose that in every sample,

91
00:05:39,614 --> 00:05:43,297
there is a fake feature that
will always have a value of one.

92
00:05:43,297 --> 00:05:46,754
So, a coefficient with
this feature is a bias.

93
00:05:46,754 --> 00:05:50,877
So in the following slides,
we don't analyze a bias separately.

94
00:05:50,877 --> 00:05:53,557
We suppose it is among the weights.

95
00:05:53,557 --> 00:05:57,659
It would be very convenient to write
our linear model in vector form.

96
00:05:57,659 --> 00:06:01,646
So, it's known from linear algebra that
dot product is exactly what you've

97
00:06:01,646 --> 00:06:03,312
written on the previous slide.

98
00:06:03,312 --> 00:06:07,963
It's multiples of vectors and
then we sum it up.

99
00:06:07,963 --> 00:06:13,093
So, our linear model is basically
a dot product of weight vector and

100
00:06:13,093 --> 00:06:14,541
feature vector X.

101
00:06:14,541 --> 00:06:18,025
And if we want to apply our
model to a whole training set or

102
00:06:18,025 --> 00:06:22,042
maybe to other set of examples,
then we do the following thing.

103
00:06:22,042 --> 00:06:24,847
We form a matrix with our sample.

104
00:06:24,847 --> 00:06:26,678
Matrix is X large.

105
00:06:26,678 --> 00:06:29,606
It has L rows and d columns.

106
00:06:29,606 --> 00:06:33,420
Each row corresponds to one same,
to one example and

107
00:06:33,420 --> 00:06:39,160
each column corresponds to values
of one feature on every example.

108
00:06:39,160 --> 00:06:42,490
Then to apply our model
to this set X large,

109
00:06:42,490 --> 00:06:48,050
we multiply matrix X by vector w and
that's our predictions.

110
00:06:48,050 --> 00:06:52,452
This multiplication will give
us the vector of size L and

111
00:06:52,452 --> 00:06:56,340
each component is a prediction of
our linear model or each example.

112
00:06:58,040 --> 00:07:02,585
The next question in machine leaning
is how to measure a quality or

113
00:07:02,585 --> 00:07:07,390
measure an error of a model on some set,
or train, or maybe test set.

114
00:07:08,960 --> 00:07:11,048
One of the most popular choices for

115
00:07:11,048 --> 00:07:14,219
loss function in regression
is mean squared error.

116
00:07:14,219 --> 00:07:15,679
It goes like this.

117
00:07:15,679 --> 00:07:20,480
We take a particular example,
Xi, for example.

118
00:07:20,480 --> 00:07:23,910
We calculate a prediction of our model for
this example for

119
00:07:23,910 --> 00:07:30,040
the linear model is that product of w and
Xi, then we subtract target value.

120
00:07:30,040 --> 00:07:35,304
So we calculate deviation of target
value from predictive value,

121
00:07:35,304 --> 00:07:37,560
then we take a square of it and

122
00:07:37,560 --> 00:07:43,205
then we average these squares of
deviations over all our training set.

123
00:07:43,205 --> 00:07:44,778
This is mean squared error.

124
00:07:44,778 --> 00:07:47,638
It measures how well our
model fits the data.

125
00:07:47,638 --> 00:07:52,004
The less mean squared error,
the better the model fits the data.

126
00:07:52,004 --> 00:07:56,504
And of course, we can write mean
squared error in vector form.

127
00:07:56,504 --> 00:08:01,244
We multiply matrix X by vector w and
we have a vector of predictions for

128
00:08:01,244 --> 00:08:05,986
all the examples in the set,
then we subtract vector of target values

129
00:08:05,986 --> 00:08:10,344
of real answers and then we take
euclidean norm of this vector.

130
00:08:10,344 --> 00:08:13,144
That is the same as the mean
squared error I described before.

131
00:08:15,980 --> 00:08:20,633
So we have a last function that measures
how well our model fits the data,

132
00:08:20,633 --> 00:08:25,542
then all we have to do is to minimize it
with respect to w, to our parameters.

133
00:08:25,542 --> 00:08:30,498
So, we want to find the parameters
set w that gives us this most mean

134
00:08:30,498 --> 00:08:32,901
squared error on our train set.

135
00:08:32,901 --> 00:08:34,813
This is the essence of machine learning.

136
00:08:34,813 --> 00:08:37,390
We optimize loss to find the best model.

137
00:08:38,970 --> 00:08:43,693
Actually, if you do some calculus, if you
take derivatives and solve the equations,

138
00:08:43,693 --> 00:08:47,979
then you'll have the analytical solution
for these optimization problems.

139
00:08:47,979 --> 00:08:51,975
It goes like this, but
it involves inverting and matrix.

140
00:08:51,975 --> 00:08:53,824
It is a very complicated operation.

141
00:08:53,824 --> 00:08:58,242
And if you have more than 100 or
1,000 features,

142
00:08:58,242 --> 00:09:04,050
then it's very hard to find an inverse
matrix exposed by extra supposed X.

143
00:09:04,050 --> 00:09:07,274
We can reduce this problem to solve it
as a system of linear equations, but

144
00:09:07,274 --> 00:09:10,396
it's still quite hard and
requires lots of computational resources.

145
00:09:10,396 --> 00:09:14,107
So later,
we'll try to find a framework for

146
00:09:14,107 --> 00:09:18,938
better, more scalable
optimization of such problems.

147
00:09:18,938 --> 00:09:21,803
In this video,
we discussed linear models for regression.

148
00:09:21,803 --> 00:09:25,623
They are very simple, but they are very
useful for deep neural networks.

149
00:09:25,623 --> 00:09:29,677
We discussed mean squared error,
a loss function for regression problems.

150
00:09:29,677 --> 00:09:33,522
And found out that it has analytical
solution, but it's not very good and

151
00:09:33,522 --> 00:09:34,770
it's hard to compute.

152
00:09:34,770 --> 00:09:40,026
So in following videos, we'll try to find
a better way to optimize such models.

153
00:09:40,026 --> 00:09:41,931
But first of all in the next video,

154
00:09:41,931 --> 00:09:45,757
we'll discuss how to apply linear
models in classification tasks.

155
00:09:45,757 --> 00:09:55,757
[MUSIC]