1
00:00:00,129 --> 00:00:04,336
[MUSIC]

2
00:00:04,336 --> 00:00:07,379
So in these algorithms,
I said that there's a stepsize.

3
00:00:07,379 --> 00:00:10,680
And you might be wondering, well,
how do you choose that stepsize?

4
00:00:10,680 --> 00:00:13,820
So, remember,
the stepsize we denoted as eta.

5
00:00:15,780 --> 00:00:18,440
And there are lots of different
choices you can make here, and

6
00:00:18,440 --> 00:00:19,770
clearly they're very important.

7
00:00:19,770 --> 00:00:22,520
This determines how much you're changing

8
00:00:22,520 --> 00:00:25,510
your W at every iteration
of this algorithm.

9
00:00:26,550 --> 00:00:30,570
One choice you can look at is something
called fixed stepsize or constant

10
00:00:30,570 --> 00:00:36,630
stepsize, where, as the name implies,
you just set eta equal to some constant.

11
00:00:36,630 --> 00:00:40,770
So for example, maybe 0.1.

12
00:00:40,770 --> 00:00:44,150
But what can happen is that you
can jump around quite a lot.

13
00:00:44,150 --> 00:00:46,760
So let's say I start far away.

14
00:00:46,760 --> 00:00:48,050
There's a really big radiant.

15
00:00:48,050 --> 00:00:52,810
I take a very big step and
I keep taking these big steps.

16
00:00:52,810 --> 00:00:58,750
And then here, I end up jumping over
the optimal to a point here and

17
00:00:58,750 --> 00:01:04,140
then I jump back and then I'm going
back and forth, and back and forth.

18
00:01:04,140 --> 00:01:08,550
And I converge very slowly
to the optimal itself.

19
00:01:08,550 --> 00:01:15,040
And so an analogy here is you can think of
a snowboarder in a half-pipe where they

20
00:01:15,040 --> 00:01:19,420
just keeping ramping up, and back, and
up, and back, and slowly it's decaying.

21
00:01:20,790 --> 00:01:24,860
And slowly they're gonna just settle
in to the groove of that half-pipe.

22
00:01:24,860 --> 00:01:28,610
But you can imagine going back and
forth and back and

23
00:01:28,610 --> 00:01:31,550
forth quite a number of times.

24
00:01:31,550 --> 00:01:35,220
But I should mention that
this fixed stepsize setting

25
00:01:36,960 --> 00:01:40,230
will work for a lot of cases we're
going to look at in this module.

26
00:01:41,710 --> 00:01:46,676
Because in most of the cases these
functions are going to be something

27
00:01:46,676 --> 00:01:49,170
that's called strongly convex.

28
00:01:49,170 --> 00:01:52,020
That's just a really well
behaved convex function.

29
00:01:52,020 --> 00:01:53,610
It's very clearly convex.

30
00:01:53,610 --> 00:01:57,880
It just never flattens out very much.

31
00:01:57,880 --> 00:02:03,600
And in those settings, you will
converge using this fixed stepsize,

32
00:02:03,600 --> 00:02:08,040
it just might take a little while.

33
00:02:08,040 --> 00:02:12,840
On the other hand,
a common choice is to decrease

34
00:02:12,840 --> 00:02:18,490
the stepsize as the number
of iterations increase.

35
00:02:18,490 --> 00:02:21,310
Okay?
So this means that there's some stepsize,

36
00:02:21,310 --> 00:02:25,522
is often called a schedule,
so I can write that.

37
00:02:25,522 --> 00:02:33,560
Or stepsize schedule that you need to set,
and you need.

38
00:02:35,210 --> 00:02:42,010
And common choices for this are that
the stepsize you're using at iteration t,

39
00:02:43,010 --> 00:02:48,170
is equal to some alpha over t or
another common choice

40
00:02:48,170 --> 00:02:53,570
is some alpha over square root of t.

41
00:02:55,780 --> 00:02:58,050
And let's just plot what
these functions look like.

42
00:02:58,050 --> 00:03:03,260
Both of these functions
look something like this.

43
00:03:03,260 --> 00:03:11,004
So the stepsize, so this would be
the plot of eta of t over iterations t.

44
00:03:14,212 --> 00:03:19,290
So the stepsize is decreasing
with the number of iterations.

45
00:03:19,290 --> 00:03:20,590
So why does this make sense?

46
00:03:20,590 --> 00:03:25,220
Well, when I start out, I'm typically
assuming that I'm gonna be far or

47
00:03:25,220 --> 00:03:28,470
decently far from the optimal,
so I want to take large steps,

48
00:03:28,470 --> 00:03:31,900
but as I'm running my algorithm,
hoping I'm getting closer, and

49
00:03:31,900 --> 00:03:36,220
I want to hone in more
rapidly to the optimal.

50
00:03:36,220 --> 00:03:38,470
So that's why I start
decreasing the stepsize.

51
00:03:38,470 --> 00:03:43,590
So that when I'm out here, initially
maybe I still take a really large jump.

52
00:03:43,590 --> 00:03:49,740
But then as I'm going in,
maybe I'll jump across once,

53
00:03:49,740 --> 00:03:55,200
but I'm gonna hone in much more
rapidly to this minimum here.

54
00:03:56,900 --> 00:03:59,440
But one thing you have to
be careful about is not

55
00:03:59,440 --> 00:04:01,790
decreasing the stepsize too rapidly.

56
00:04:01,790 --> 00:04:06,000
Because if you're doing that, you're
gonna, again, take a while to converge.

57
00:04:06,000 --> 00:04:09,910
Because you're just gonna be taking very,
very, very small steps.

58
00:04:09,910 --> 00:04:14,860
Okay, so in summary choosing your
stepsize is just a bit of an art.

59
00:04:14,860 --> 00:04:18,170
But here we've gone through a couple
of examples of things that are commonly

60
00:04:18,170 --> 00:04:19,190
used out there.

61
00:04:19,190 --> 00:04:23,410
Okay, so the other part of
the algorithm that we left unspecified

62
00:04:23,410 --> 00:04:27,580
is this part where we say,
while not converged.

63
00:04:27,580 --> 00:04:29,100
How are we going to
assess our convergence?

64
00:04:30,900 --> 00:04:35,330
Well, we know that the optimum occurs
when the derivative of the function

65
00:04:37,140 --> 00:04:38,175
is equal to 0.

66
00:04:39,530 --> 00:04:44,130
But what we're gonna see in practice,

67
00:04:44,130 --> 00:04:48,970
is that the derivative,
it's gonna get smaller and smaller and

68
00:04:48,970 --> 00:04:51,830
smaller and smaller and smaller, it's
gonna get very, very, very, very, very,

69
00:04:51,830 --> 00:04:55,590
very, very, very small,
but it won't be exactly 0.

70
00:04:55,590 --> 00:04:58,230
At some point, we're going to want to say,
okay, that's good enough.

71
00:04:58,230 --> 00:04:59,610
We're close enough to the optimum.

72
00:04:59,610 --> 00:05:01,140
I'm gonna terminate this algorithm.

73
00:05:01,140 --> 00:05:03,970
And I'm gonna say that
this is my solution.

74
00:05:03,970 --> 00:05:08,940
So what we're saying is that,
what we're gonna need to specify

75
00:05:08,940 --> 00:05:12,740
is something where we say,
when the absolute value of the derivative,

76
00:05:12,740 --> 00:05:15,440
I don't care if I'm a little
bit to the right or

77
00:05:15,440 --> 00:05:20,730
a little bit to the left of the optimum,
but just what the absolute value is.

78
00:05:20,730 --> 00:05:24,310
If that is less than sum epsilon.

79
00:05:24,310 --> 00:05:26,129
This is a threshold I'm setting.

80
00:05:30,962 --> 00:05:36,109
Then if this is satisfied,
then I'm gonna terminate the algorithm and

81
00:05:36,109 --> 00:05:38,780
return whatever solution I have Wt.

82
00:05:40,110 --> 00:05:45,650
So in practice, we're just gonna
choose epsilon to be very small.

83
00:05:45,650 --> 00:05:49,850
And I wanna emphasize that what
very small means depends on

84
00:05:49,850 --> 00:05:55,190
the data that you're looking at,
what the form of this function is.

85
00:05:55,190 --> 00:05:58,340
What are the range of
gradients we might expect?

86
00:05:59,530 --> 00:06:04,300
Is the value of the function, a plot of
the value of the function over iterations?

87
00:06:04,300 --> 00:06:07,140
And you'll tend to see that
the value decrease, and

88
00:06:07,140 --> 00:06:09,330
it's basically not changing very much.

89
00:06:09,330 --> 00:06:11,711
And at that point,
you know that you've converged.

90
00:06:11,711 --> 00:06:15,929
[MUSIC]