1
00:00:03,721 --> 00:00:09,427
So last thing I wanted to talk about
in today's lecture is prefetching or

2
00:00:09,427 --> 00:00:11,961
the technique of prefetching.

3
00:00:13,120 --> 00:00:14,205
So what is prefetching?

4
00:00:14,205 --> 00:00:19,583
Well, you speculate that you're
going to be using some let's

5
00:00:19,583 --> 00:00:26,630
say either instruction or
data In the near future.

6
00:00:26,630 --> 00:00:28,240
So you can say,

7
00:00:28,240 --> 00:00:34,070
I think I'm going to be accessing
some cache line in the near future.

8
00:00:34,070 --> 00:00:35,464
Now you haven't accessed it yet,

9
00:00:35,464 --> 00:00:38,017
the processor hasn't gone to
initiate this transaction yet.

10
00:00:38,017 --> 00:00:42,847
But because you have some reason to
believe that it's a good idea to

11
00:00:42,847 --> 00:00:43,718
go do this.

12
00:00:43,718 --> 00:00:48,086
You speculatively go and
fetch that data from either memory or

13
00:00:48,086 --> 00:00:50,189
a farther out level of cache.

14
00:00:50,189 --> 00:00:55,182
And bring it into a nearer level of cache.

15
00:00:55,182 --> 00:00:57,460
So this is the basic idea of prefetching.

16
00:00:57,460 --> 00:01:00,520
Well what does this look like?

17
00:01:00,520 --> 00:01:03,077
Well, there are a couple different
varieties of prefetching we're going to be

18
00:01:03,077 --> 00:01:04,550
talking about today.

19
00:01:04,550 --> 00:01:06,317
So we're going to talk
about hardware prefetching.

20
00:01:06,317 --> 00:01:11,277
Which is a complete hardware
based solution where the hardware

21
00:01:11,277 --> 00:01:15,401
tries to detect what is
a useful thing to go bring in.

22
00:01:15,401 --> 00:01:17,662
We're going to talk about
software prefetching.

23
00:01:17,662 --> 00:01:19,296
Where the compiler or

24
00:01:19,296 --> 00:01:24,569
the programmer decides what is a useful
piece of data to go get early.

25
00:01:24,569 --> 00:01:25,682
And we can talk about mixed schemes.

26
00:01:25,682 --> 00:01:30,303
Where it's software hinting
to hardware potentially

27
00:01:30,303 --> 00:01:34,327
about when it's a good
idea to go bring in data.

28
00:01:34,327 --> 00:01:37,825
Okay so, important question that
comes up on the slide here.

29
00:01:37,825 --> 00:01:43,180
What types of cache misses
does prefetching affect?

30
00:01:47,876 --> 00:01:49,020
Hm?

31
00:01:51,714 --> 00:01:52,916
What does it affect?

32
00:01:52,916 --> 00:01:56,483
Well we're bringing in data
potentially before it's actually used.

33
00:02:00,574 --> 00:02:03,659
So that sounds good, this sounds all good.

34
00:02:03,659 --> 00:02:07,624
But we're actually polluting our cache
potentially with data we don't know

35
00:02:07,624 --> 00:02:09,780
that we're going to need yet.

36
00:02:09,780 --> 00:02:12,412
So, there can be downsides to this.

37
00:02:12,412 --> 00:02:14,915
So let's go through the three Cs.

38
00:02:14,915 --> 00:02:18,490
Capacity, does this affect capacity?

39
00:02:20,320 --> 00:02:28,210
Well, yes, it actually affects capacity,
in what is effectively a negative way.

40
00:02:29,380 --> 00:02:31,570
You effectively polluted your cache,

41
00:02:31,570 --> 00:02:34,050
because you've prefetched
something into your cache.

42
00:02:34,050 --> 00:02:36,570
And you could potentially make

43
00:02:36,570 --> 00:02:40,940
the cache have a smaller effective
capacity because of this.

44
00:02:41,980 --> 00:02:43,742
And this actually trades off,

45
00:02:43,742 --> 00:02:46,801
this is a similar sort of
thing with conflict misses.

46
00:02:46,801 --> 00:02:49,193
You're effectively putting
data into your cache.

47
00:02:49,193 --> 00:02:53,381
That if you have a really bad prefetcher,
a prefetcher which actually goes and

48
00:02:53,381 --> 00:02:54,277
brings data in.

49
00:02:54,277 --> 00:02:57,659
That is not accessed ever we'll say, or

50
00:02:57,659 --> 00:03:01,946
not accessed in a short
period of time afterwards.

51
00:03:01,946 --> 00:03:05,358
You may actually be both polluting
capacity and creating more conflicts.

52
00:03:05,358 --> 00:03:07,165
You can bring it into a line.

53
00:03:07,165 --> 00:03:12,105
And because of this some other piece of
data that gets brought in has to go evict

54
00:03:12,105 --> 00:03:12,790
that now.

55
00:03:12,790 --> 00:03:14,110
And it takes time to go evict it.

56
00:03:14,110 --> 00:03:16,169
Or at least has to make some decision.

57
00:03:16,169 --> 00:03:22,690
And might even mess up the recently
used policy, the replacement policy.

58
00:03:22,690 --> 00:03:23,730
Okay, so why do we do this?

59
00:03:23,730 --> 00:03:29,210
We said this is bad for conflict misses,
or increases our conflict misses.

60
00:03:29,210 --> 00:03:31,200
Where does prefetching help?

61
00:03:31,200 --> 00:03:34,079
Well, it can actually really
help with compulsory misses.

62
00:03:35,140 --> 00:03:40,298
So misses where you just miss
on effectively cold data.

63
00:03:40,298 --> 00:03:44,300
The data's just not in your cache,
you have to go get it and bring it in.

64
00:03:44,300 --> 00:03:47,314
So let's say a first touch data or
compulsory miss.

65
00:03:47,314 --> 00:03:50,174
It would be really great
if the prefetch system or

66
00:03:50,174 --> 00:03:52,628
either a hardware software prefetcher.

67
00:03:52,628 --> 00:03:55,740
Could bring that data into
your cache preemptively.

68
00:03:55,740 --> 00:04:00,470
And by bringing that data into your cache
preemptively either reduce the latency or

69
00:04:00,470 --> 00:04:04,790
remove the latency completely
of compulsory misses.

70
00:04:04,790 --> 00:04:08,050
So how do we reduce that miss cost?

71
00:04:08,050 --> 00:04:11,983
Well let's say you go to take
cache miss on a line and

72
00:04:11,983 --> 00:04:18,404
the prefetcher already has sort of started
to go get that line from the main memory.

73
00:04:18,404 --> 00:04:22,683
The missed data's handling register could
detect that there's a missed argument flag

74
00:04:22,683 --> 00:04:23,448
for that line.

75
00:04:23,448 --> 00:04:26,523
And not actually send out a subsequent
request and just wait for

76
00:04:26,523 --> 00:04:27,690
that data to come back.

77
00:04:27,690 --> 00:04:32,890
And this can decrease the latency
of compulsory miss or

78
00:04:32,890 --> 00:04:36,910
decrease the miss cost of compulsory miss.

79
00:04:36,910 --> 00:04:39,781
Alternatively, the compulsory
message can just go away.

80
00:04:39,781 --> 00:04:43,249
If the prefetcher is really on its game,
it can bring the data into your cache.

81
00:04:43,249 --> 00:04:49,247
And by the time you go to access that
data, it's clairvoyant, if you will.

82
00:04:49,247 --> 00:04:52,642
It already knows that data that you
were going to access in the future and

83
00:04:52,642 --> 00:04:54,010
it's already pulled it in.

84
00:04:54,010 --> 00:04:55,890
When you go to access it,
it's already there.

85
00:04:56,910 --> 00:05:01,000
So prefetching can have some
pretty big benefits there.

86
00:05:01,000 --> 00:05:04,929
But like all things,
including caches in general,

87
00:05:04,929 --> 00:05:10,480
you need to make sure that what you
bring in is used with high probability.

88
00:05:10,480 --> 00:05:13,279
Otherwise, you're just going
to pollute your cache.

89
00:05:13,279 --> 00:05:15,176
One last thing I wanted to say
is that we're going to see this.

90
00:05:15,176 --> 00:05:19,776
Is that instruction caches
are usually easier to predict for

91
00:05:19,776 --> 00:05:22,400
prefetching purposes than data.

92
00:05:23,400 --> 00:05:24,320
Why is this?

93
00:05:24,320 --> 00:05:26,780
Well instructions are very well behaved.

94
00:05:26,780 --> 00:05:28,050
Typically you execute one instruction.

95
00:05:28,050 --> 00:05:29,590
Then you execute the next instruction.

96
00:05:29,590 --> 00:05:32,200
Then you execute the next
instruction after it.

97
00:05:32,200 --> 00:05:34,385
And yeah there are branches.

98
00:05:34,385 --> 00:05:36,302
And programs do have
a fair amount of branches.

99
00:05:36,302 --> 00:05:39,530
But it i s a decent chance you're just
going to be executing straight line code,

100
00:05:39,530 --> 00:05:41,280
especially in something that'll loop.

101
00:05:41,280 --> 00:05:43,382
Where you're going to be executing
straight line code for a while and

102
00:05:43,382 --> 00:05:45,196
hit one branch at the bottom and
then go back up to the top.

103
00:05:45,196 --> 00:05:50,500
So your prefetcher can many times
be very effective for that.

104
00:05:50,500 --> 00:05:53,530
Data on the other hand,
usually data misses and

105
00:05:53,530 --> 00:05:56,647
data accesses are a little
bit more randomized.

106
00:05:56,647 --> 00:05:59,769
But not always, sometimes you're
just going to be accessing,

107
00:05:59,769 --> 00:06:01,398
let's say straight line data.

108
00:06:01,398 --> 00:06:05,078
Something like a mem copy, or
something like taking array and

109
00:06:05,078 --> 00:06:06,647
reading the whole array.

110
00:06:06,647 --> 00:06:10,000
And a prefetcher can be
very effective there also.

111
00:06:11,140 --> 00:06:15,770
So, let's take a look at some
of the issues with prefetching.

112
00:06:15,770 --> 00:06:17,640
We've already touched on
some of these things.

113
00:06:17,640 --> 00:06:22,670
In order for a prefetcher to
be good it has to be useful.

114
00:06:22,670 --> 00:06:26,460
So it should produce hits in the cache.

115
00:06:26,460 --> 00:06:29,755
If the prefetcher is actually
producing mostly misses in the cache.

116
00:06:29,755 --> 00:06:32,521
Or bringing in stuff that,

117
00:06:32,521 --> 00:06:37,827
if it's on the instruction
side it never executes.

118
00:06:37,827 --> 00:06:41,335
Or on the data side, it never tries to
go to a loader and store [INAUDIBLE],

119
00:06:41,335 --> 00:06:42,320
it wasn't useful.

120
00:06:43,900 --> 00:06:47,140
Now, things get a little bit more
complex here when you start to think

121
00:06:47,140 --> 00:06:48,550
about usefulness.

122
00:06:48,550 --> 00:06:52,240
You have to start thinking
about timeliness of the data.

123
00:06:52,240 --> 00:06:53,630
So what do we mean by timeliness?

124
00:06:53,630 --> 00:06:57,020
Well, if you prefetch a piece
of data into your cache.

125
00:06:59,480 --> 00:07:02,870
But you do it too early, it's very
possible by the time you go to go

126
00:07:02,870 --> 00:07:07,340
read the data, it might have
gotten evicted from the cache.

127
00:07:07,340 --> 00:07:09,030
So, if you bring it in too early,

128
00:07:09,030 --> 00:07:11,510
the prefetcher might have
just wasted bandwidth.

129
00:07:11,510 --> 00:07:16,110
And wasted energy going to the next level
of cache or going out to maintain memory.

130
00:07:16,110 --> 00:07:21,110
And that data may not end up being
useful because it wasn't timely.

131
00:07:21,110 --> 00:07:23,470
You might have gotten the address right,
but

132
00:07:23,470 --> 00:07:27,560
gotten it there too early, or it might
have figured out that it was too late.

133
00:07:27,560 --> 00:07:28,291
So how does this happen?

134
00:07:28,291 --> 00:07:32,307
So let's say, you determine that
this one line is really good and

135
00:07:32,307 --> 00:07:35,531
there's a high probability
you're going to need it.

136
00:07:35,531 --> 00:07:37,490
So you [INAUDIBLE] want to go pull it in.

137
00:07:37,490 --> 00:07:42,730
The prefetcher decides the cycle
after the load happens to that line.

138
00:07:42,730 --> 00:07:44,080
Well, it didn't help.

139
00:07:44,080 --> 00:07:45,580
Maybe even the cycle before,

140
00:07:45,580 --> 00:07:49,790
it just wouldn't help because
you're too late, effectively.

141
00:07:50,890 --> 00:07:51,800
And as we've already discussed,

142
00:07:51,800 --> 00:07:54,450
you can have significant cache and
bandwidth pollution in here.

143
00:07:54,450 --> 00:07:59,220
So you can pollute
the capacity of the cache and

144
00:07:59,220 --> 00:08:01,430
you can waste a lot of bandwidth.

145
00:08:01,430 --> 00:08:05,046
And this is a problem if, for instance,
you're in a multicore chip or

146
00:08:05,046 --> 00:08:09,000
many-core chip, and off-chip memory
balance is really important.

147
00:08:09,000 --> 00:08:10,710
And it's a really limited resource,

148
00:08:10,710 --> 00:08:13,685
you may not actually want to have any
prefetch turned on because you might

149
00:08:13,685 --> 00:08:18,605
just be wasting bandwidth, so it's
a speculative execution sort of thing,

150
00:08:18,605 --> 00:08:21,515
so you're going to be trying to
pull in data, but it's speculative.

151
00:08:21,515 --> 00:08:24,835
You don't know 100% that you're going to
be using that data, and any time that you

152
00:08:24,835 --> 00:08:29,635
bring something in that you don't use,
you've effectively wasted bandwidth.

153
00:08:29,635 --> 00:08:34,698
So, this is a big challenge here.

154
00:08:34,698 --> 00:08:40,308
From the bandwidth perspective,
usually a good place to put a prefetcher

155
00:08:40,308 --> 00:08:45,102
is maybe somewhere between,
let's say, a level two cache and

156
00:08:45,102 --> 00:08:49,805
a level one cache especially
maybe on the instruction side but

157
00:08:49,805 --> 00:08:54,084
you team this that you one
will basically pulling data.

158
00:08:54,084 --> 00:08:57,857
And this bandwidth here is relatively
inexpensive from your level one to your

159
00:08:57,857 --> 00:09:01,181
level two because it's an on ship bus,
it's directly connected.

160
00:09:01,181 --> 00:09:05,697
You can make it relatively large, coming
out of level two cache though, usually or

161
00:09:05,697 --> 00:09:09,244
going out of let's say,
the last level cache out to main memory,

162
00:09:09,244 --> 00:09:11,460
that's usually expensive bandwidth.

163
00:09:11,460 --> 00:09:13,470
So, sometimes people
come up with prefetchers.

164
00:09:13,470 --> 00:09:17,510
The only prefetch from a level
two cache to a level one cache.

165
00:09:17,510 --> 00:09:19,970
But if it's not in the level two cache,
prefetch just gets dropped.

166
00:09:21,090 --> 00:09:22,900
So that's a pretty common strategy here.

167
00:09:25,010 --> 00:09:29,000
Let's start off by looking
at the simple case here of

168
00:09:29,000 --> 00:09:33,100
instruction prefetching in
hardware with a basic example.

169
00:09:33,100 --> 00:09:36,744
Something like what they
had in the Alpha 21064.

170
00:09:37,890 --> 00:09:41,940
So it's a very relatively basic processor.

171
00:09:41,940 --> 00:09:46,350
And we want to look at what they did for
performance.

172
00:09:47,870 --> 00:09:52,400
They actually added an extra little
buffer here that they're going to

173
00:09:53,680 --> 00:09:54,650
call a stream buffer.

174
00:09:56,160 --> 00:10:04,590
This stream buffer here is going to store
the next line that is likely to be used.

175
00:10:04,590 --> 00:10:08,060
So if you take a cache miss,
this is on the instruction side for

176
00:10:08,060 --> 00:10:12,950
particular instruction cache block.

177
00:10:12,950 --> 00:10:16,929
Let's say block i in this example.

178
00:10:16,929 --> 00:10:21,176
You decide, the hardware decides with high

179
00:10:21,176 --> 00:10:25,950
probability block i + 1 is
probably going to be used.

180
00:10:27,390 --> 00:10:32,350
But instead of polluting our
instruction cache in their design what

181
00:10:32,350 --> 00:10:39,250
they did is the subsequent cache block is
actually stored into the stream buffer.

182
00:10:41,090 --> 00:10:42,300
So you actually pollute their cache.

183
00:10:42,300 --> 00:10:46,520
This is kind of like doing little bit
of extra associativity here only for

184
00:10:46,520 --> 00:10:48,190
the next predictive line, if you will.

185
00:10:50,810 --> 00:10:52,970
Okay, so in this example here,

186
00:10:52,970 --> 00:10:56,960
what would happen is,
you'd be executing level one cache.

187
00:10:56,960 --> 00:10:59,080
And let's say you fall through,
the code just falls through,

188
00:10:59,080 --> 00:11:03,150
there's no branch at the end of
the cache line And you fall through.

189
00:11:03,150 --> 00:11:05,680
Well, the predictor actually
did really well here and

190
00:11:05,680 --> 00:11:07,940
it overlapped defectively the cost,

191
00:11:07,940 --> 00:11:12,810
going out to level two capture the main
memory for that subsequent cache line.

192
00:11:12,810 --> 00:11:19,400
So it hits in the stream buffer and the
stream buffer get moved into the cache and

193
00:11:19,400 --> 00:11:23,780
the harvest system says or
the hard pre-system says A-ha,

194
00:11:23,780 --> 00:11:27,100
I went in,
just I'm now executing block (i + 1).

195
00:11:27,100 --> 00:11:31,975
Maybe it's a good idea to go and
prefetch block (i + 2).

196
00:11:31,975 --> 00:11:39,940
And effectively what you're doing here
is overlapping the execution of block i.

197
00:11:39,940 --> 00:11:43,340
With the fetch of block i+1.

198
00:11:43,340 --> 00:11:47,540
So you can be doing two
things at one time.

199
00:11:47,540 --> 00:11:51,567
You can be using the CPU fetching,
excuse me, from block i.

200
00:11:51,567 --> 00:11:54,203
And you can be using the cache,

201
00:11:54,203 --> 00:11:59,388
moving data this way into
the stream buffer for block i+1.

202
00:11:59,388 --> 00:12:04,204
And likewise, when you go to fetch,
when you're executing, let's say,

203
00:12:04,204 --> 00:12:08,006
block (i+1),
you might be fetching block (i+2).

204
00:12:08,006 --> 00:12:11,580
Okay, so that's the instruction
prefetch side of the world.

205
00:12:11,580 --> 00:12:16,940
Life gets a little more complex when we
start to go look at the data prefetching.

206
00:12:16,940 --> 00:12:17,840
Now why is this?

207
00:12:17,840 --> 00:12:20,890
Well, it's not as
well-structured as instructions.

208
00:12:20,890 --> 00:12:24,310
Instructions you just but typically
blocks of the code keep falling through.

209
00:12:26,820 --> 00:12:32,553
In the data world, you don't necessarily
know because you've went in and

210
00:12:32,553 --> 00:12:35,101
accessed let's say address a but

211
00:12:35,101 --> 00:12:39,288
you will go access some other
address because of that.

212
00:12:41,508 --> 00:12:42,713
But you can try.

213
00:12:42,713 --> 00:12:45,994
So people have implemented a bunch
of different prefetches that

214
00:12:45,994 --> 00:12:47,096
help to some extent.

215
00:12:48,224 --> 00:12:51,258
The most basic thing that they're not
doing is actually just doing something

216
00:12:51,258 --> 00:12:54,460
very similar to this that we
had in the instruction example.

217
00:12:54,460 --> 00:12:59,140
If you go and you take a cache
miss on let's say block B,

218
00:13:01,220 --> 00:13:06,940
you go also fetch block
b + 1 into your cache.

219
00:13:06,940 --> 00:13:08,700
And this always tends to help.

220
00:13:08,700 --> 00:13:11,460
You have to be a little bit
careful here because you might be

221
00:13:11,460 --> 00:13:13,320
wasting capacity and bandwidth.

222
00:13:13,320 --> 00:13:15,250
But in some cases this is helpful.

223
00:13:15,250 --> 00:13:18,600
So if you're streaming through an array,
this helps.

224
00:13:18,600 --> 00:13:20,360
If you're doing a memcpy.

225
00:13:20,360 --> 00:13:21,190
This helps.

226
00:13:21,190 --> 00:13:22,400
This prefetch on miss.

227
00:13:24,170 --> 00:13:27,920
Note we say prefetch-on-miss here,
but we don't say pre-fetch on hit,

228
00:13:27,920 --> 00:13:33,775
which is a little bit different than our
example in the instruction space there.

229
00:13:33,775 --> 00:13:36,930
Prefetch-on-miss is the only
way to cache miss that you go,

230
00:13:36,930 --> 00:13:39,570
let's say, fetch B plus one.

231
00:13:39,570 --> 00:13:41,650
Now, there's different strategies here.

232
00:13:41,650 --> 00:13:45,080
Prefetch on hit, prefetch on miss, you
might think about doing either of these

233
00:13:45,080 --> 00:13:49,210
depending on the heuristics of
what your hardware comes up with.

234
00:13:50,960 --> 00:13:56,820
So that's prefetch on miss, maybe you
actually want to do one block lookahead.

235
00:13:56,820 --> 00:13:59,500
So this is much more, this is basically
the same thing we were doing in

236
00:13:59,500 --> 00:14:03,070
the instruction side here,
is if you go access, let's say,

237
00:14:03,070 --> 00:14:07,640
Block b, and
this means not necessarily cache missed.

238
00:14:07,640 --> 00:14:12,128
You go and actually fetch from your
next level audited cache b + 1.

239
00:14:12,128 --> 00:14:16,485
Hm, okay, so
the question that comes up is,

240
00:14:16,485 --> 00:14:21,887
is this the same thing as this
doubling your block size?

241
00:14:26,971 --> 00:14:30,863
Well, that's interesting
because you effectively

242
00:14:30,863 --> 00:14:35,020
go fetch whenever you touch b,
you go prefetch b plus 1.

243
00:14:35,020 --> 00:14:40,470
This kind of looks like just making
your block size bigger because

244
00:14:40,470 --> 00:14:42,340
this is just doubling the block size.

245
00:14:42,340 --> 00:14:44,240
You're always bringing in two blocks.

246
00:14:44,240 --> 00:14:48,904
But it's only a prefetch operation, so
you could think about having it, for

247
00:14:48,904 --> 00:14:53,639
instance, just bold from your L2 into
your L1, when you have this example,

248
00:14:53,639 --> 00:14:56,894
not actually have to go out
to your last load of cache.

249
00:14:56,894 --> 00:15:00,315
So, you could think about having
different trade-offs there.

250
00:15:00,315 --> 00:15:04,220
Also the down side of this is that it's
not just making your block size bigger,

251
00:15:04,220 --> 00:15:06,900
in fact we have two sets of tags here.

252
00:15:06,900 --> 00:15:12,210
That's the down side of having something
like this one block look ahead scheme.

253
00:15:12,210 --> 00:15:14,300
But you can think about
taking this example and

254
00:15:14,300 --> 00:15:17,820
actually extending it to and
end block look ahead.

255
00:15:17,820 --> 00:15:22,505
So if you go to access block,
let's say b, or cache line b.

256
00:15:22,505 --> 00:15:25,915
You go fetch a couple after that.

257
00:15:25,915 --> 00:15:30,255
And the name of the game here is
you're trying to hide memory latency,

258
00:15:30,255 --> 00:15:31,815
main memory latency.

259
00:15:31,815 --> 00:15:38,090
So what would be really great is if
you go to access, let's say block b.

260
00:15:38,090 --> 00:15:41,960
And you know that the hardware
somehow determines if heuristics,

261
00:15:41,960 --> 00:15:46,120
typically how you implement these things
is little watchers that watch what

262
00:15:46,120 --> 00:15:50,210
accesses you're doing and then decide it's
fruitful to actually go and do a prefetch.

263
00:15:50,210 --> 00:15:53,110
You could say, I'm accessing block b,

264
00:15:53,110 --> 00:15:58,756
I should start trying to
read from main memory block.

265
00:15:58,756 --> 00:16:03,070
b+1, b+2, b+3, b+4, all the way up to,
let's say big n here.

266
00:16:04,680 --> 00:16:07,280
And by doing this, you can overlap the.

267
00:16:07,280 --> 00:16:10,981
So that if try to access that next block,
it will have already been here and

268
00:16:10,981 --> 00:16:14,220
you don't have to wait for
the going out to main memory.

269
00:16:14,220 --> 00:16:16,980
This can actually increase our
performance significantly.

270
00:16:18,190 --> 00:16:23,630
Okay, so that's the basic examples but
those don't always work well.

271
00:16:24,970 --> 00:16:28,550
Sometimes you might have
something like an array access,

272
00:16:28,550 --> 00:16:31,330
where you're not touching
every bite in the array.

273
00:16:33,180 --> 00:16:37,593
Maybe you're accessing every nth
bite in the array or some offset.

274
00:16:37,593 --> 00:16:38,752
Well how does this happen?

275
00:16:38,752 --> 00:16:41,680
Well this is actually really
common if you're trying to access.

276
00:16:41,680 --> 00:16:46,370
An array of structures or
an array of let's say class objects.

277
00:16:46,370 --> 00:16:47,020
Why is this?

278
00:16:47,020 --> 00:16:50,660
Well if you have a structure you have
different fields in the structure and

279
00:16:50,660 --> 00:16:52,920
different offsets inside of the structure.

280
00:16:52,920 --> 00:16:56,670
So if you have an array of them
they're packed densely and

281
00:16:56,670 --> 00:17:01,210
if you are reading let's say
some subfield of a structure and

282
00:17:01,210 --> 00:17:05,400
you're reading every
element of that array.

283
00:17:05,400 --> 00:17:07,610
You're actually going to have
what's called a strided access.

284
00:17:07,610 --> 00:17:10,890
So you're not going to be
accessing byte one and

285
00:17:10,890 --> 00:17:13,260
then byte two and byte three,
what was called unistride.

286
00:17:13,260 --> 00:17:16,600
Instead, you're going to be having some
offset as you move through the array.

287
00:17:16,600 --> 00:17:20,340
And if the structure is large enough, what
might end up happening is, you may not

288
00:17:20,340 --> 00:17:25,160
actually want to go and fetch the next
line and the next line and the next line.

289
00:17:25,160 --> 00:17:29,680
Instead, you might want to fetch
just the lines they contain

290
00:17:29,680 --> 00:17:33,470
the portion of the structure
that you want to go access.

291
00:17:33,470 --> 00:17:34,740
So, what does this look like?

292
00:17:34,740 --> 00:17:38,610
Well, you need something like
a strided hardware prefetcher.

293
00:17:38,610 --> 00:17:44,040
So if you see a pattern happening,

294
00:17:44,040 --> 00:17:47,080
typically that's when you fire up
some of your strided prefetcher.

295
00:17:47,080 --> 00:17:47,660
So if you go and

296
00:17:47,660 --> 00:17:52,640
look at a modern day Intel processor
they actually have stride detectors.

297
00:17:52,640 --> 00:17:59,461
So it's little pieces of hardware which
watch the cache access pattern or

298
00:17:59,461 --> 00:18:03,862
the processor access pattern and
will say aha,

299
00:18:03,862 --> 00:18:08,704
I see that the processor
is accessing cache block b,

300
00:18:08,704 --> 00:18:14,447
cache block b + N, cache block b + 2N,
cache block b + 3N.

301
00:18:16,558 --> 00:18:21,800
Maybe we should go prefetch
cache block b + 4N.

302
00:18:21,800 --> 00:18:24,600
So, but N is the stride size.

303
00:18:25,940 --> 00:18:30,950
So you may not be pulling in every
block and series, but instead

304
00:18:30,950 --> 00:18:35,710
pulling out end blocks that are spaced
out through memory with some spacing.

305
00:18:36,800 --> 00:18:40,560
But with everything, you have to be
careful if this does not pull your cache

306
00:18:40,560 --> 00:18:46,970
from bandwidth or a capacity perspective.

307
00:18:46,970 --> 00:18:48,480
So, let's take a look at example here.

308
00:18:48,480 --> 00:18:50,088
We have a Power Five.

309
00:18:50,088 --> 00:18:55,207
The Power Five has pretty sophisticated
hardware base predictor or

310
00:18:55,207 --> 00:18:58,990
hardware base preficity and
prediction scheme.

311
00:18:58,990 --> 00:19:01,740
So they actually have eight independent

312
00:19:01,740 --> 00:19:05,770
stride [INAUDIBLE]
pro-processor which can each.

313
00:19:05,770 --> 00:19:10,800
Prefetch up to 12 lines
ahead of the current access.

314
00:19:12,560 --> 00:19:13,780
Wow, so that's pretty impressive.

315
00:19:13,780 --> 00:19:16,910
They basically have little
pieces of hardware,

316
00:19:16,910 --> 00:19:21,230
they're seeing what is a likely
thing to go access based on

317
00:19:21,230 --> 00:19:25,430
what the program is doing and try to
bring it in early to improve performance.

318
00:19:25,430 --> 00:19:31,040
And reduce the cache miss latency,
and reduce

319
00:19:31,040 --> 00:19:36,210
the cache miss time or rate, if you will.

320
00:19:37,210 --> 00:19:38,770
Okay.
That's hardware prefetching.

321
00:19:40,280 --> 00:19:43,450
We can also think about having
software based prefetching.

322
00:19:44,720 --> 00:19:47,640
The compiler can figure out aha!

323
00:19:50,180 --> 00:19:53,830
Or maybe the programmer
who's writing code can say,

324
00:19:53,830 --> 00:19:56,540
I know that it's going to
go on many times.

325
00:19:58,760 --> 00:20:04,810
So instead of waiting to take
a cache from this when I really,

326
00:20:04,810 --> 00:20:10,170
really need it, which is when I have to
go use add here, let's say a(i) And b[i].

327
00:20:10,170 --> 00:20:14,840
Instead, we can say okay,
I know the next time around the loop,

328
00:20:14,840 --> 00:20:17,830
I'm going to be accessing a[i+1] and
b[i+1], so

329
00:20:17,830 --> 00:20:23,123
let's tell the memory system that
I'm going to be accessing a[i+1] and

330
00:20:23,123 --> 00:20:27,720
b[i+1], and
get that operation out there early.

331
00:20:29,770 --> 00:20:34,040
So we can effectively have
a software help the hardware here

332
00:20:34,040 --> 00:20:36,380
by doing complete software prefetch.

333
00:20:36,380 --> 00:20:40,140
And there's sort of two ways
you can think about doing this.

334
00:20:40,140 --> 00:20:44,950
One way is these prefetches
are basically just loads

335
00:20:44,950 --> 00:20:47,660
that'll bring the line in probably.

336
00:20:47,660 --> 00:20:51,970
Alternatively some architectures have
a more hybrid hardware software approach

337
00:20:51,970 --> 00:20:56,140
where you'll actually have
a prefetch instruction and

338
00:20:56,140 --> 00:20:59,190
the prefetch instruction
will tell the hardware or

339
00:20:59,190 --> 00:21:03,840
hint to the hardware that it's likely that
you're going to go and be using this data

340
00:21:03,840 --> 00:21:08,180
but maybe you don't want to waste
off chip memory bandwidth for this.

341
00:21:08,180 --> 00:21:11,680
So only pull in let's say from
the L2 to the level one cache.

342
00:21:11,680 --> 00:21:16,280
But don't go take cache miss for this,
because this has some probability of not

343
00:21:16,280 --> 00:21:22,170
being correct, and shouldn't take
priority over true cache misses.

344
00:21:22,170 --> 00:21:25,540
And if you had just a normal load
in operation for your prefetch,

345
00:21:25,540 --> 00:21:28,800
you wouldn't be able to detect that case.

346
00:21:28,800 --> 00:21:33,420
In some hyper hybrid solutions,
prefetches will actually turn to special

347
00:21:33,420 --> 00:21:36,970
instructions software prefetches will
actually turn in special instructions.

348
00:21:38,440 --> 00:21:41,910
Okay so what are the issues
with software prefetching?

349
00:21:43,040 --> 00:21:47,630
Well, timing actually ends up being

350
00:21:47,630 --> 00:21:51,620
the biggest challenge here,
not predictability.

351
00:21:51,620 --> 00:21:53,390
This look here is very predictable.

352
00:21:53,390 --> 00:21:59,350
We know we're just going to stride
through array A and array B,

353
00:21:59,350 --> 00:22:06,750
but what we don't know is we might
end up being too late or too early.

354
00:22:06,750 --> 00:22:12,350
So, you have to get sort
of the right time, you have

355
00:22:12,350 --> 00:22:15,900
to get the time right when you actually
go to access these different things.

356
00:22:15,900 --> 00:22:18,920
So, what we're going to do
here is we're going to say,

357
00:22:20,980 --> 00:22:24,470
Instead of accessing i+1,
we're going to access maybe i+p here.

358
00:22:24,470 --> 00:22:27,670
And this is a settable parameter,

359
00:22:27,670 --> 00:22:31,670
where you can say, this is a few catch
lines into the future, we can pull it in.

360
00:22:31,670 --> 00:22:36,490
And we want to set that correctly so
that we

361
00:22:36,490 --> 00:22:41,760
pull in at the right time so that we
can cover the cash missed latency time.

362
00:22:41,760 --> 00:22:45,440
Out to the either next level of catch or
maybe out to main memory, but

363
00:22:45,440 --> 00:22:47,720
not really enough that we
actually pollute the cache and

364
00:22:47,720 --> 00:22:50,650
actually cause conflicts somewhere else.

365
00:22:50,650 --> 00:22:56,070
If we said P here huge, we were
pulling in potentially way too early.

366
00:22:56,070 --> 00:22:59,050
So if we set this to,
let's say bigger than the array size A and

367
00:22:59,050 --> 00:23:02,940
B, it's very possible that we would
actually be polluting the cache with

368
00:23:04,990 --> 00:23:09,360
wrong data, if you will.

369
00:23:09,360 --> 00:23:15,060
Few other things that come up here is,
how do we go about trying to estimate P.

370
00:23:15,060 --> 00:23:17,400
Well, this is a tricky thing.

371
00:23:18,500 --> 00:23:19,310
Why is this hard?

372
00:23:19,310 --> 00:23:24,601
Well, we don't quite know what the is and

373
00:23:24,601 --> 00:23:29,450
we don't quite know whether a and
b are going to hit or miss or

374
00:23:29,450 --> 00:23:33,140
whether the hardware pre-fetcher is going
to fight against our software pre-fetcher.

375
00:23:33,140 --> 00:23:35,420
And things get a little complex here and

376
00:23:35,420 --> 00:23:38,210
it's really just because you
don't know the memory load.

377
00:23:38,210 --> 00:23:40,590
The actual congestion at
the memory controller or

378
00:23:40,590 --> 00:23:45,320
you don't know whether lines miss or
don't.

379
00:23:45,320 --> 00:23:51,450
So, you effectively have
a piece of a dynamic number,

380
00:23:51,450 --> 00:23:56,360
which is being determined at
static compile time, we'll say.

381
00:23:56,360 --> 00:24:00,630
And also you don't quite know what
architecture you're going to be on.

382
00:24:00,630 --> 00:24:04,040
Different architectures, or
different micro architectures for

383
00:24:04,040 --> 00:24:07,100
the same architecture set architecture,

384
00:24:07,100 --> 00:24:12,040
might have different cash mislancies so if
you wanted to write one piece of software

385
00:24:12,040 --> 00:24:15,220
that works on all of them there
may not be a correct value of P.

386
00:24:16,410 --> 00:24:18,223
So, you have to think about that,

387
00:24:18,223 --> 00:24:22,287
finally I wanted to say you must consider
the cost of prefetch instructions,

388
00:24:22,287 --> 00:24:26,224
these are not free, you have to add
these two extra instructions here, and

389
00:24:26,224 --> 00:24:31,187
if your performance benefit At
the performance benefit, having

390
00:24:31,187 --> 00:24:34,510
prefetch instructions does not outweigh
the cost of adding two extra instructions.

391
00:24:34,510 --> 00:24:35,670
It's probably not a good trade-off.

392
00:24:35,670 --> 00:24:40,040
You probably shouldn't be
doing something like this.

393
00:24:40,040 --> 00:24:44,780
So, next lecture we're going to talk
a little bit more about different software

394
00:24:44,780 --> 00:24:49,190
prefetching issues and other ways
to rearrange code in the compiler.

395
00:24:49,190 --> 00:24:53,730
To effectively bring data in
in a more structured manner.

396
00:24:53,730 --> 00:25:00,540
But lets take a look at
the Efficacy of Prefetching.

397
00:25:00,540 --> 00:25:04,070
So which of these things does this help or
hurt?

398
00:25:04,070 --> 00:25:06,640
Well what we're going to see is.

399
00:25:06,640 --> 00:25:11,870
First of all, we're going to actually
potentially help the miss penalty.

400
00:25:12,900 --> 00:25:15,107
So, when is the miss penalty help?

401
00:25:15,107 --> 00:25:18,164
When is the miss penalty made better?

402
00:25:18,164 --> 00:25:23,560
Well, if you have a cash, or excuse me,
if you have a prefetch inflight

403
00:25:23,560 --> 00:25:27,150
to a piece of data that you
then have a demand miss for.

404
00:25:28,170 --> 00:25:31,640
You effectively cut the latency,
for that demand miss, or

405
00:25:31,640 --> 00:25:36,870
the missed penalty down to just finishing
the currently in flight pre fetch.

406
00:25:36,870 --> 00:25:40,010
So you can take that cost and
make it a little bit shorter.

407
00:25:41,350 --> 00:25:42,340
Miss rate.

408
00:25:42,340 --> 00:25:45,050
Miss rate also is better.

409
00:25:45,050 --> 00:25:46,600
Why is miss rate better?

410
00:25:46,600 --> 00:25:51,480
Well If you have a prefetch occur and
it brough into your cache,

411
00:25:51,480 --> 00:25:56,160
and then you hit on that data, you just
have fewer misses out of your cache, so

412
00:25:56,160 --> 00:25:57,530
your miss rate is better.

413
00:25:58,910 --> 00:26:03,170
A couple other things here,
Hit Time and Bandwidth.

414
00:26:03,170 --> 00:26:07,800
Prefecthing If done poorly,
it might actually hurt your bandwidth.

415
00:26:07,800 --> 00:26:11,360
So there might be a negative
here if done poorly.

416
00:26:12,910 --> 00:26:16,480
And hit time typically not
affected by prefetching.

417
00:26:17,510 --> 00:26:18,720
Okay, we'll stop here for today.