1 00:00:00,000 --> 00:00:02,772 [MUSIC] 2 00:00:02,772 --> 00:00:05,650 So that defines a gradient. 3 00:00:05,650 --> 00:00:08,250 But one thing that's gonna be useful 4 00:00:08,250 --> 00:00:13,260 is just a different way to visualize the surfaces that we're optimizing over. 5 00:00:13,260 --> 00:00:17,850 So instead of looking at these 3D mesh plots that we've been looking at, we can 6 00:00:17,850 --> 00:00:21,932 look at a contour plot, where we can kind of think of this as a bird's eye view. 7 00:00:21,932 --> 00:00:28,270 [SOUND] It's 8 00:00:28,270 --> 00:00:33,133 a bird's eye view of this function that we're examining, 9 00:00:33,133 --> 00:00:39,660 where here we've now taken this 3D mesh and just transformed it to a 2D plane. 10 00:00:39,660 --> 00:00:45,057 So we have, just in case you can't see it in the slides, this is w0 here. 11 00:00:45,057 --> 00:00:50,660 And this is w1, that's a very big zero. 12 00:00:50,660 --> 00:00:55,020 Okay, and what each one of these curves is. 13 00:00:55,020 --> 00:00:58,255 Again, I'm gonna switch colors to red so it's more visible. 14 00:01:03,433 --> 00:01:05,964 So this curve, for example, 15 00:01:11,513 --> 00:01:16,040 Represents just taking this mesh on the left hand side and slicing it. 16 00:01:16,040 --> 00:01:20,392 So it's just a slice through the 3D function. 17 00:01:20,392 --> 00:01:21,220 So. 18 00:01:26,027 --> 00:01:28,047 Sorry, the pen is really misbehaving. 19 00:01:35,848 --> 00:01:41,171 A slice of the 3D surface where 20 00:01:41,171 --> 00:01:46,085 all values here have the same 21 00:01:46,085 --> 00:01:50,690 value of this function. 22 00:01:50,690 --> 00:01:56,000 Sorry, our functions are called g(w0, w1). 23 00:01:56,000 --> 00:01:59,870 So let me just step up and say what I'm trying to say here, 24 00:01:59,870 --> 00:02:05,030 which is every w0, w1 pair along 25 00:02:07,190 --> 00:02:12,910 this ellipse here Has the same value of the function g 26 00:02:12,910 --> 00:02:18,370 because it was just a flat slice through that 3D contour that we're looking at. 27 00:02:19,440 --> 00:02:24,010 Okay, so each of these rings have, in this case, they are increasing 28 00:02:24,010 --> 00:02:28,640 values of the function as we go from blue, blue means a low value of the function, 29 00:02:28,640 --> 00:02:31,470 all the way out to red, that means a high value of the function. 30 00:02:32,690 --> 00:02:34,440 We're looking at different slices. 31 00:02:34,440 --> 00:02:37,960 We're slicing the function at different values, and 32 00:02:37,960 --> 00:02:40,540 that creates these different contours. 33 00:02:40,540 --> 00:02:45,320 Okay, so this is what's called a contour plot, and it's useful because 34 00:02:46,740 --> 00:02:50,850 it's easier to work with 2D when we're on a 2D surface here. 35 00:02:50,850 --> 00:02:53,640 So drawing things will be easier with this representation. 36 00:02:55,010 --> 00:02:58,300 Okay, so that was just a little detour into contour plots. 37 00:02:58,300 --> 00:03:03,030 So that I can talk about the gradient descent algorithm, which is the analogous 38 00:03:03,030 --> 00:03:09,160 algorithm to what I call the hill decent algorithm in 1D. 39 00:03:09,160 --> 00:03:12,840 But, in place of the derivative of the function, 40 00:03:12,840 --> 00:03:16,210 we've now specified the gradient of the function. 41 00:03:16,210 --> 00:03:19,450 And other than that, everything looks exactly the same. 42 00:03:19,450 --> 00:03:23,010 So what we're doing, is we're taking 43 00:03:23,010 --> 00:03:27,950 we now have a vector of parameters, and we're updating them all at once. 44 00:03:27,950 --> 00:03:30,929 We're taking our previous vector and 45 00:03:30,929 --> 00:03:36,050 we're updating 46 00:03:36,050 --> 00:03:41,860 with our sum, adda times our gradient which was also a vector. 47 00:03:41,860 --> 00:03:47,370 So, it's just the vector analog of the hill descent algorithm. 48 00:03:47,370 --> 00:03:51,731 But, if I wanna show this a little bit in pictures here. 49 00:03:54,332 --> 00:03:58,120 Again switching back to red because it'll be easier to see on this plat. 50 00:03:58,120 --> 00:04:01,170 Well if i'm out here at a point 51 00:04:01,170 --> 00:04:05,770 the gradient is actually it's pointing in the direction of steepest assent. 52 00:04:05,770 --> 00:04:07,380 So that's up hill. 53 00:04:07,380 --> 00:04:08,430 It's pointing this way. 54 00:04:10,850 --> 00:04:13,290 But we're moving in the negative gradient direction. 55 00:04:13,290 --> 00:04:18,390 So let me specify that this thing here is our gradient, 56 00:04:21,970 --> 00:04:27,680 gradient direction, but 57 00:04:27,680 --> 00:04:32,590 then our steps are gonna be in the opposite direction. 58 00:04:32,590 --> 00:04:36,790 So let me actually draw- sorry to take up a little time here but 59 00:04:36,790 --> 00:04:38,988 I think it's worthwhile for clarity. 60 00:04:38,988 --> 00:04:41,820 Let me just happen to draw the gradient so that it's a purple vector so 61 00:04:41,820 --> 00:04:45,360 it's different from the vectors I'm going to be drawing right now. 62 00:04:47,570 --> 00:04:48,680 Okay. 63 00:04:48,680 --> 00:04:51,440 Cuz the other vectors that I'm gonna be drawing right now 64 00:04:51,440 --> 00:04:53,910 are the steps of my gradient descent algorithm. 65 00:04:53,910 --> 00:04:59,408 So the actual steps I'm taking 66 00:04:59,408 --> 00:05:04,062 are gonna be moving here, 67 00:05:07,336 --> 00:05:10,000 Towards this optimal value. 68 00:05:13,280 --> 00:05:18,490 So, it's exactly like what we saw in the 1D case, 69 00:05:18,490 --> 00:05:21,249 but now we're moving it in a 2D space. 70 00:05:22,270 --> 00:05:27,950 Or really any dimensional space but what I'm drawing is just a 2D space. 71 00:05:27,950 --> 00:05:33,240 And in terms of assessing convergence in this case well in place of looking at 72 00:05:34,600 --> 00:05:38,140 the absolute value of the derivative we're going to look at 73 00:05:40,290 --> 00:05:43,100 the magnitude of the gradient. 74 00:05:45,400 --> 00:05:52,212 And when the magnitude of the gradient is less that sum epsilon that we're fixing, 75 00:05:52,212 --> 00:05:56,511 we're gonna say that the algorithm has converged. 76 00:05:56,511 --> 00:06:00,299 >> [MUSIC]