1 00:00:00,000 --> 00:00:04,052 In this video, I'm gonna talk about three different types of machine learning: 2 00:00:04,052 --> 00:00:08,057 supervised learning, reinforcement learning and unsupervised learning. 3 00:00:08,057 --> 00:00:13,027 Broadly speaking, the first half of the course will be about supervised learning. 4 00:00:13,027 --> 00:00:17,079 The second half of the course will be mainly about unsupervised learning, and 5 00:00:17,079 --> 00:00:22,049 reinforcement learning will not be covered in the course, because we can't cover 6 00:00:22,049 --> 00:00:26,060 everything. Learning can be divided into three broad 7 00:00:26,060 --> 00:00:30,067 groups of algorithms. In supervised learning, you're trying to 8 00:00:30,067 --> 00:00:35,092 predict an output when given an input vector, so it's very clear what the point 9 00:00:35,092 --> 00:00:41,017 of supervised learning is. In reinforcement level, you're trying to 10 00:00:41,017 --> 00:00:46,607 select actions or sequences of actions to maximize the rewards you get, and the 11 00:00:46,607 --> 00:00:53,030 rewards may only occur occasionally. In unsupervised learning you're trying to 12 00:00:53,030 --> 00:00:59,577 discover a good internal representation of the input and we'll come later to what 13 00:00:59,577 --> 00:01:03,795 that might mean. Supervised learning itself comes in two 14 00:01:03,795 --> 00:01:08,121 different flavors. In regression, the target output is a real 15 00:01:08,121 --> 00:01:14,135 number or a whole vector of real numbers, such as the price of a stock in six months 16 00:01:14,135 --> 00:01:21,059 time, or the temperature at noon tomorrow. And the aim is to get as close as you can 17 00:01:21,059 --> 00:01:25,399 to the correct real number. In classification, the target output is a 18 00:01:25,399 --> 00:01:29,364 class label. The simplest case is a choice between one 19 00:01:29,364 --> 00:01:32,606 and zero. Between positive and negative cases. 20 00:01:32,606 --> 00:01:37,636 But obviously, we can have multiple alternative labels as when we're 21 00:01:37,636 --> 00:01:44,492 classifying handwritten digits. Supervised learning works by initially 22 00:01:44,492 --> 00:01:49,512 selecting a model class, that is, a whole set of models that we're prepared to 23 00:01:49,512 --> 00:01:53,422 consider as candidates. You can think of a model class as a 24 00:01:53,422 --> 00:01:59,825 function that takes an input vector and some parameters and gives you an output y. 25 00:01:59,825 --> 00:02:03,836 So a model class is simply a way of mapping. 26 00:02:03,836 --> 00:02:10,939 An input to an output using some numerical parameters W and then we adjust these 27 00:02:10,939 --> 00:02:16,394 numerical parameters to make the mapping fit the supervised training data. 28 00:02:16,394 --> 00:02:22,046 What we mean by fit is minimizing the discrepancy between the target output on 29 00:02:22,046 --> 00:02:27,255 each training case and the actual output produced by a machine learning system. 30 00:02:27,255 --> 00:02:32,591 And an obvious measure of that discrepancy, if we're using real values as 31 00:02:32,591 --> 00:02:38,746 outputs, is the square difference between the output from our system y, and the 32 00:02:38,746 --> 00:02:44,057 correct output t, and we put in that one-half, so it cancels the two when we 33 00:02:44,057 --> 00:02:47,453 differentiate. For classification you could use that 34 00:02:47,453 --> 00:02:51,994 measure, but there's other more sensiblbe measures which we'll come to later, and 35 00:02:51,994 --> 00:02:56,203 these more sensibile measures typically work better as well. 36 00:02:56,203 --> 00:03:03,055 In reinforcement learning, the outputs an actual sequence of actions, and you have 37 00:03:03,055 --> 00:03:07,080 to decide on those actions based on occasional rewards. 38 00:03:07,080 --> 00:03:12,516 The goal in selecting each action is to maximize the expected sum of the future 39 00:03:12,516 --> 00:03:17,139 rewards, and we typically use a discount factor so that you don't have to look too 40 00:03:17,139 --> 00:03:20,472 far in the future. We say that rewards far in the future 41 00:03:20,472 --> 00:03:24,592 don't count for as much as rewards that you get fairly quickly. 42 00:03:24,592 --> 00:03:29,538 Reinforcement learning is difficult. It's difficult because the rewards are 43 00:03:29,538 --> 00:03:34,451 typically delayed, so it's hard to know exactly which action was the wrong one in 44 00:03:34,451 --> 00:03:38,007 a long sequence of actions. It's also difficult because a scalar 45 00:03:38,007 --> 00:03:41,879 award, especially one that only occurs occasionally, does not supply much 46 00:03:41,879 --> 00:03:45,082 information, on which to base the changes in parameters. 47 00:03:45,082 --> 00:03:50,235 So typically, you can't learn millions of parameters using reinforcement learning. 48 00:03:50,235 --> 00:03:53,830 Whereas supervised learning and unsupervised learning, you can. 49 00:03:53,830 --> 00:03:57,798 Typically, in reinforcement learning, you're trying to learn dozens of 50 00:03:57,798 --> 00:04:00,755 parameters or maybe 1,000 parameters, but not millions. 51 00:04:00,755 --> 00:04:04,827 In this course, we can't cover everything, and so we're not going to cover 52 00:04:04,827 --> 00:04:08,552 reinforcement learning, even though it's an important topic. 53 00:04:08,552 --> 00:04:14,350 Unsupervised learning, is going to be covered in the second half of the course. 54 00:04:14,350 --> 00:04:20,040 For about 40 years, the machine learning community basically ignored unsupervised 55 00:04:20,040 --> 00:04:24,282 learning except for one very limited form called clustering. 56 00:04:24,282 --> 00:04:28,990 In fact, they used definitions of machine learning that excluded it. 57 00:04:28,990 --> 00:04:34,481 So they defined machine learning, in some textbooks, as mapping from inputs to 58 00:04:34,481 --> 00:04:37,589 outputs. And many researchers thought that 59 00:04:37,589 --> 00:04:40,822 clustering was the only form of unsupervised learning. 60 00:04:40,822 --> 00:04:46,870 One reason for this is that it's hard to say what the aim of unsupervised learning 61 00:04:46,870 --> 00:04:50,518 is. One major aim is to get an internal 62 00:04:50,518 --> 00:04:54,879 representation of the input, that is useful for subsequent supervised or 63 00:04:54,879 --> 00:04:59,188 reinforcement learning. And the reason we might want to do that in 64 00:04:59,188 --> 00:05:04,481 two stages, is we don't want to use, for example, the payoffs from reinforcement 65 00:05:04,481 --> 00:05:08,503 learning, in order to set the parameters, for our visual system. 66 00:05:08,503 --> 00:05:13,310 So you can compute the distance to a surface by using the disparity between the 67 00:05:13,310 --> 00:05:17,076 images you get in your two eyes. But you don't want to learn to do that 68 00:05:17,076 --> 00:05:21,003 computation of distance by repeatedly stubbing your toe and adjusting the 69 00:05:21,003 --> 00:05:24,566 parameters in your visual system every time you stub your toe. 70 00:05:24,566 --> 00:05:29,100 That would involve stubbing your toe a very large number of times and there's 71 00:05:29,100 --> 00:05:33,474 much better ways to learn to fuse two images based purely on the information in 72 00:05:33,474 --> 00:05:37,799 the inputs. Other goals for unsupervised learning are 73 00:05:37,799 --> 00:05:42,194 to provide compact, low dimensional representations of the input. 74 00:05:42,194 --> 00:05:47,149 So, high-dimensional inputs like images, typically, live on or near a 75 00:05:47,149 --> 00:05:51,599 low-dimensional manifold. Or several such manifolds in the case of 76 00:05:51,599 --> 00:05:55,584 the handwritten digits. What that means is, even if you have a 77 00:05:55,584 --> 00:06:00,605 million pixels, there aren't really a million degrees of freedom in what can 78 00:06:00,605 --> 00:06:04,118 happen. There may only be a few hundred degrees of 79 00:06:04,118 --> 00:06:08,025 freedom in what can happen. So what we want to do is move from a 80 00:06:08,025 --> 00:06:12,617 million pixels to a representation of those few hundred degrees of freedom which 81 00:06:12,617 --> 00:06:15,804 will be according to saying where we are on a manifold. 82 00:06:15,804 --> 00:06:18,342 Also we need to know which manifold we're on. 83 00:06:18,342 --> 00:06:24,321 A very limited form of this is principle commands analysis which is linear. 84 00:06:24,321 --> 00:06:29,064 It assumes that there's one manifold, and the manifold is a plane in the high 85 00:06:29,064 --> 00:06:33,323 dimensional space. Another definition of unsupervised 86 00:06:33,323 --> 00:06:37,846 learning, or another goal for unsupervised learning, is to prov-, to provide an 87 00:06:37,846 --> 00:06:41,746 economical representation for the input in terms of learned features. 88 00:06:41,746 --> 00:06:46,605 If, for example, we can represent the input in terms of binary features, that's 89 00:06:46,605 --> 00:06:51,552 typically economical cuz then it takes only one bit to say the state of a binary 90 00:06:51,552 --> 00:06:54,600 feature. Alternatively we could use a large number 91 00:06:54,600 --> 00:06:59,330 of real valued features but insist that for each input almost all of those 92 00:06:59,330 --> 00:07:03,481 features are exactly zero. In that case for each input we only need 93 00:07:03,481 --> 00:07:07,107 to represent a few real numbers and that's economical. 94 00:07:07,107 --> 00:07:13,711 As I mentioned before, another definition of unsupervised learning or another goal 95 00:07:13,711 --> 00:07:18,543 of unsupervised learning is to find clusters in the input, and clustering 96 00:07:18,543 --> 00:07:23,969 could be viewed as a very sparse code, that is we have one feature per cluster 97 00:07:23,969 --> 00:07:30,062 and we insist that all the features except one are zero and that one feature has a 98 00:07:30,062 --> 00:07:33,814 value of one. So clustering is really just an extreme 99 00:07:33,814 --> 00:07:36,037 case of finding sparse features.