1 00:00:00,000 --> 00:00:03,945 [MUSIC] 2 00:00:03,945 --> 00:00:06,623 You've heard me hint about this a lot today, and 3 00:00:06,623 --> 00:00:10,557 the practical issues of stochastic gradient are pretty significant. 4 00:00:10,557 --> 00:00:13,014 So for most of the remaining of the module, 5 00:00:13,014 --> 00:00:16,780 we'll talk about how to address some of those practical issues. 6 00:00:18,270 --> 00:00:21,250 Let's take a moment to review the stochastic gradient algorithm. 7 00:00:21,250 --> 00:00:24,550 We initialize our parameters, our coefficients to some value, 8 00:00:24,550 --> 00:00:25,510 let's say all 0. 9 00:00:25,510 --> 00:00:31,828 And then until convergence, we go 1 data point at a time, and 1 feature at a time. 10 00:00:31,828 --> 00:00:36,788 And we update the coefficient of that feature by just computing the gradient 11 00:00:36,788 --> 00:00:38,580 at that single data point. 12 00:00:39,600 --> 00:00:43,950 So I need a contribution of each data point 1 at a time. 13 00:00:43,950 --> 00:00:48,700 So we're scaling down the data, update the parameters 1 data point at a time. 14 00:00:48,700 --> 00:00:54,747 For example, I see my first data point here 0 awesome, 2 awful, sentiment -1. 15 00:00:54,747 --> 00:01:01,040 I make an update which pushes me towards predicting -1 to this data point. 16 00:01:01,040 --> 00:01:03,537 Now if the next data point is also negative, 17 00:01:03,537 --> 00:01:06,246 I'm going to make another kind of negative push. 18 00:01:06,246 --> 00:01:10,089 If the third one is negative, make another negative push, and so on. 19 00:01:10,089 --> 00:01:14,551 And so, one worry, one bad thing that can happen with stochastic 20 00:01:14,551 --> 00:01:18,867 gradient is if your data is implicitly sorted in a certain way. 21 00:01:18,867 --> 00:01:19,911 So, for example, 22 00:01:19,911 --> 00:01:24,010 all the negative data points are coming before all the positives. 23 00:01:24,010 --> 00:01:28,499 They can introduce some bad behaviors to the algorithm, bad practical performance. 24 00:01:28,499 --> 00:01:33,590 And so we worry about this a lot and it's really significant. 25 00:01:33,590 --> 00:01:38,090 So, because of that, before you start running stochastic gradient you 26 00:01:38,090 --> 00:01:41,830 should always shuffle the rows of the data, mix them up. 27 00:01:41,830 --> 00:01:45,070 So that you don't have this long regions of, say, 28 00:01:45,070 --> 00:01:50,030 negatives before positives, or young people before older people. 29 00:01:50,030 --> 00:01:53,930 Or people who live in one country versus people who live in another country. 30 00:01:53,930 --> 00:01:54,840 You want to mix it all up. 31 00:01:56,650 --> 00:02:00,530 So what that means, from the context of the stochastic gradient algorithm that we 32 00:02:00,530 --> 00:02:04,230 just saw, is just adding a line at the beginning where you shuffle the data. 33 00:02:04,230 --> 00:02:08,850 So before doing anything, you should start by shuffling the data.