1 00:00:00,578 --> 00:00:03,769 [MUSIC] 2 00:00:03,769 --> 00:00:04,997 Hi. In this video, 3 00:00:04,997 --> 00:00:06,679 we'll discuss linear models. 4 00:00:06,679 --> 00:00:10,473 One of the simplest models in machine learning, but linear models are building 5 00:00:10,473 --> 00:00:13,706 blocks for deep neural networks that we will discuss in our course. 6 00:00:13,706 --> 00:00:17,546 So, they are quite important for us and let's start with an example. 7 00:00:17,546 --> 00:00:22,598 Suppose you are given an image and the goal is to count sea lions on this image, 8 00:00:22,598 --> 00:00:26,429 this aerial world problem that was hosted on kegel.com. 9 00:00:26,429 --> 00:00:32,050 So we want to write a program, a function a that takes an image as an input, 10 00:00:32,050 --> 00:00:37,327 counts sea lions on it and counts the number of sea lions on the image. 11 00:00:37,327 --> 00:00:41,639 Of course, we can come up with some heuristics like detect edges of 12 00:00:41,639 --> 00:00:46,263 the objects on this photograph and try to count connected components. 13 00:00:46,263 --> 00:00:49,645 But this approach is inferior to machine learning. 14 00:00:49,645 --> 00:00:53,050 In machine learning, we try to collect a labelled set of images. 15 00:00:55,320 --> 00:00:58,065 So, we try to collect like 1,000 or 16 00:00:58,065 --> 00:01:02,036 maybe even 1 million of such photographs and label them. 17 00:01:02,036 --> 00:01:03,410 We count sea lions. 18 00:01:03,410 --> 00:01:08,587 We have a grand truth for every image and then we try to learn a function from data. 19 00:01:08,587 --> 00:01:13,400 We try to fund a function a that fits this data the best. 20 00:01:16,440 --> 00:01:20,133 Let's give some definitions that will be very useful for us. 21 00:01:20,133 --> 00:01:20,804 An image or 22 00:01:20,804 --> 00:01:25,878 any other object that we try to analyze in machine learning is called an example. 23 00:01:25,878 --> 00:01:29,813 And if it's an example that we try and model, it's a training example. 24 00:01:29,813 --> 00:01:34,669 We describe each example with deep characteristics that we call features. 25 00:01:34,669 --> 00:01:39,641 For example, for an images, features are intensities of every pixel on the image or 26 00:01:39,641 --> 00:01:40,701 something else. 27 00:01:40,701 --> 00:01:43,190 So, we have examples. 28 00:01:43,190 --> 00:01:46,485 And in supervised learning, we have target values. 29 00:01:46,485 --> 00:01:49,640 We have a grand truth and answer for each example. 30 00:01:49,640 --> 00:01:52,738 For example, in the problem of count in sea lions, 31 00:01:52,738 --> 00:01:56,714 we have a number of sea lions for every example, for every image. 32 00:01:56,714 --> 00:02:00,250 We denote this target values by y. 33 00:02:00,250 --> 00:02:02,886 So for example, xi. 34 00:02:02,886 --> 00:02:06,127 The target value is yi. 35 00:02:06,127 --> 00:02:11,484 As I said in machine learning, we tried to collect a set of label examples. 36 00:02:11,484 --> 00:02:16,450 We denoted by X large and it's a set of pairs of L pairs that have 37 00:02:16,450 --> 00:02:21,433 an example with its feature description, and target value. 38 00:02:21,433 --> 00:02:24,359 And finally, we want to find a modal, 39 00:02:24,359 --> 00:02:28,025 a function that maps examples to ta get values. 40 00:02:28,025 --> 00:02:33,783 We denoted by a(x) model or hypothesis and the goal of machine 41 00:02:33,783 --> 00:02:40,204 learning is to find a modal that fits the training set x by the best way. 42 00:02:40,204 --> 00:02:44,531 There are two main classes of supervised learning problems, 43 00:02:44,531 --> 00:02:46,952 regression and classification. 44 00:02:46,952 --> 00:02:50,091 In regression, the target value is a real value. 45 00:02:50,091 --> 00:02:53,859 For example, if we count sea lions, the target value is real. 46 00:02:53,859 --> 00:02:55,795 Actually, it's natural numbers. 47 00:02:55,795 --> 00:03:00,817 But it's also regression or, for example, if given a job description and 48 00:03:00,817 --> 00:03:04,498 we try to predict what salary will be given on this job. 49 00:03:04,498 --> 00:03:08,950 That's also regression since salary is a real value. 50 00:03:08,950 --> 00:03:14,013 Or for example, if we're given movie review from some user who tried 51 00:03:14,013 --> 00:03:19,964 to determine what rating will users give to this movie on a scale from one to five. 52 00:03:19,964 --> 00:03:23,165 It's also can be solved as a regression problem. 53 00:03:23,165 --> 00:03:26,986 On the other hand, if the number of target failures is finite, 54 00:03:26,986 --> 00:03:28,872 it's a classification task. 55 00:03:28,872 --> 00:03:33,815 For example, if we want to recognize some objects on images, for example, 56 00:03:33,815 --> 00:03:37,598 we want to find out whether there are cats or dogs or grass or 57 00:03:37,598 --> 00:03:40,091 maybe clouds or bicycle on the image. 58 00:03:40,091 --> 00:03:41,745 It's an object recognition task. 59 00:03:41,745 --> 00:03:45,331 Since number of answers is finite, there is finite number of objects, 60 00:03:45,331 --> 00:03:48,010 then we are solving classification task. 61 00:03:48,010 --> 00:03:50,830 Or for example, if we are analyzing news articles and 62 00:03:50,830 --> 00:03:55,540 want to find out what topic this article belongs to, is it about politics or 63 00:03:55,540 --> 00:03:59,460 sports or entertainment, then it's also classification tasks since 64 00:03:59,460 --> 00:04:02,010 number of target values is once again, finite. 65 00:04:04,440 --> 00:04:07,570 Let's discuss this very simple dataset. 66 00:04:07,570 --> 00:04:11,820 Each object, each example is described with one feature and 67 00:04:11,820 --> 00:04:13,831 you have real value target. 68 00:04:13,831 --> 00:04:19,106 Here is the dataset, so we can see that there is a linear trend. 69 00:04:19,106 --> 00:04:21,416 If feature increases two times, 70 00:04:21,416 --> 00:04:25,320 then target decreases somewhere about two times. 71 00:04:25,320 --> 00:04:29,376 So maybe we could use some linear model to describe this data, 72 00:04:29,376 --> 00:04:31,328 to build a predictive model. 73 00:04:31,328 --> 00:04:33,398 Here's linear model. 74 00:04:33,398 --> 00:04:38,303 It's very simple and has just two parameters, w1 and w0. 75 00:04:38,303 --> 00:04:44,044 And if you find best weights, w1 and w0, then we'll have a model like this one. 76 00:04:44,044 --> 00:04:45,433 It describes data very well. 77 00:04:45,433 --> 00:04:46,401 It isn't perfect. 78 00:04:46,401 --> 00:04:51,186 It doesn't predict the exact target value for each example, but 79 00:04:51,186 --> 00:04:53,280 it fits the data quite well. 80 00:04:53,280 --> 00:04:57,418 Of course, in most machine learning tasks, there are many features. 81 00:04:57,418 --> 00:05:01,370 So, we can use a generic linear model like this one. 82 00:05:01,370 --> 00:05:08,160 So it takes each feature, x, j multiplies it by weight wj. 83 00:05:08,160 --> 00:05:14,100 Sums these multiplicates of all the features and then adds a biased term, b. 84 00:05:15,350 --> 00:05:16,719 This is a linear model. 85 00:05:16,719 --> 00:05:23,565 It has d+1 parameters where d is the number of features in our dataset. 86 00:05:23,565 --> 00:05:28,916 There are d weights or coefficients and one bias term, b. 87 00:05:28,916 --> 00:05:29,834 It's a very simple model. 88 00:05:29,834 --> 00:05:33,345 Because for example, neural networks have much more parameters for 89 00:05:33,345 --> 00:05:35,510 the same number of features. 90 00:05:35,510 --> 00:05:39,614 And to make it even simpler, we'll suppose that in every sample, 91 00:05:39,614 --> 00:05:43,297 there is a fake feature that will always have a value of one. 92 00:05:43,297 --> 00:05:46,754 So, a coefficient with this feature is a bias. 93 00:05:46,754 --> 00:05:50,877 So in the following slides, we don't analyze a bias separately. 94 00:05:50,877 --> 00:05:53,557 We suppose it is among the weights. 95 00:05:53,557 --> 00:05:57,659 It would be very convenient to write our linear model in vector form. 96 00:05:57,659 --> 00:06:01,646 So, it's known from linear algebra that dot product is exactly what you've 97 00:06:01,646 --> 00:06:03,312 written on the previous slide. 98 00:06:03,312 --> 00:06:07,963 It's multiples of vectors and then we sum it up. 99 00:06:07,963 --> 00:06:13,093 So, our linear model is basically a dot product of weight vector and 100 00:06:13,093 --> 00:06:14,541 feature vector X. 101 00:06:14,541 --> 00:06:18,025 And if we want to apply our model to a whole training set or 102 00:06:18,025 --> 00:06:22,042 maybe to other set of examples, then we do the following thing. 103 00:06:22,042 --> 00:06:24,847 We form a matrix with our sample. 104 00:06:24,847 --> 00:06:26,678 Matrix is X large. 105 00:06:26,678 --> 00:06:29,606 It has L rows and d columns. 106 00:06:29,606 --> 00:06:33,420 Each row corresponds to one same, to one example and 107 00:06:33,420 --> 00:06:39,160 each column corresponds to values of one feature on every example. 108 00:06:39,160 --> 00:06:42,490 Then to apply our model to this set X large, 109 00:06:42,490 --> 00:06:48,050 we multiply matrix X by vector w and that's our predictions. 110 00:06:48,050 --> 00:06:52,452 This multiplication will give us the vector of size L and 111 00:06:52,452 --> 00:06:56,340 each component is a prediction of our linear model or each example. 112 00:06:58,040 --> 00:07:02,585 The next question in machine leaning is how to measure a quality or 113 00:07:02,585 --> 00:07:07,390 measure an error of a model on some set, or train, or maybe test set. 114 00:07:08,960 --> 00:07:11,048 One of the most popular choices for 115 00:07:11,048 --> 00:07:14,219 loss function in regression is mean squared error. 116 00:07:14,219 --> 00:07:15,679 It goes like this. 117 00:07:15,679 --> 00:07:20,480 We take a particular example, Xi, for example. 118 00:07:20,480 --> 00:07:23,910 We calculate a prediction of our model for this example for 119 00:07:23,910 --> 00:07:30,040 the linear model is that product of w and Xi, then we subtract target value. 120 00:07:30,040 --> 00:07:35,304 So we calculate deviation of target value from predictive value, 121 00:07:35,304 --> 00:07:37,560 then we take a square of it and 122 00:07:37,560 --> 00:07:43,205 then we average these squares of deviations over all our training set. 123 00:07:43,205 --> 00:07:44,778 This is mean squared error. 124 00:07:44,778 --> 00:07:47,638 It measures how well our model fits the data. 125 00:07:47,638 --> 00:07:52,004 The less mean squared error, the better the model fits the data. 126 00:07:52,004 --> 00:07:56,504 And of course, we can write mean squared error in vector form. 127 00:07:56,504 --> 00:08:01,244 We multiply matrix X by vector w and we have a vector of predictions for 128 00:08:01,244 --> 00:08:05,986 all the examples in the set, then we subtract vector of target values 129 00:08:05,986 --> 00:08:10,344 of real answers and then we take euclidean norm of this vector. 130 00:08:10,344 --> 00:08:13,144 That is the same as the mean squared error I described before. 131 00:08:15,980 --> 00:08:20,633 So we have a last function that measures how well our model fits the data, 132 00:08:20,633 --> 00:08:25,542 then all we have to do is to minimize it with respect to w, to our parameters. 133 00:08:25,542 --> 00:08:30,498 So, we want to find the parameters set w that gives us this most mean 134 00:08:30,498 --> 00:08:32,901 squared error on our train set. 135 00:08:32,901 --> 00:08:34,813 This is the essence of machine learning. 136 00:08:34,813 --> 00:08:37,390 We optimize loss to find the best model. 137 00:08:38,970 --> 00:08:43,693 Actually, if you do some calculus, if you take derivatives and solve the equations, 138 00:08:43,693 --> 00:08:47,979 then you'll have the analytical solution for these optimization problems. 139 00:08:47,979 --> 00:08:51,975 It goes like this, but it involves inverting and matrix. 140 00:08:51,975 --> 00:08:53,824 It is a very complicated operation. 141 00:08:53,824 --> 00:08:58,242 And if you have more than 100 or 1,000 features, 142 00:08:58,242 --> 00:09:04,050 then it's very hard to find an inverse matrix exposed by extra supposed X. 143 00:09:04,050 --> 00:09:07,274 We can reduce this problem to solve it as a system of linear equations, but 144 00:09:07,274 --> 00:09:10,396 it's still quite hard and requires lots of computational resources. 145 00:09:10,396 --> 00:09:14,107 So later, we'll try to find a framework for 146 00:09:14,107 --> 00:09:18,938 better, more scalable optimization of such problems. 147 00:09:18,938 --> 00:09:21,803 In this video, we discussed linear models for regression. 148 00:09:21,803 --> 00:09:25,623 They are very simple, but they are very useful for deep neural networks. 149 00:09:25,623 --> 00:09:29,677 We discussed mean squared error, a loss function for regression problems. 150 00:09:29,677 --> 00:09:33,522 And found out that it has analytical solution, but it's not very good and 151 00:09:33,522 --> 00:09:34,770 it's hard to compute. 152 00:09:34,770 --> 00:09:40,026 So in following videos, we'll try to find a better way to optimize such models. 153 00:09:40,026 --> 00:09:41,931 But first of all in the next video, 154 00:09:41,931 --> 00:09:45,757 we'll discuss how to apply linear models in classification tasks. 155 00:09:45,757 --> 00:09:55,757 [MUSIC]