1
00:00:03,240 --> 00:00:05,840
Okay. 
So, we, we talked about vector 

2
00:00:05,840 --> 00:00:09,741
processors. 
We introduced this sort of short vector 

3
00:00:09,741 --> 00:00:14,024
idea with single instruction multiple 
data instructions. 

4
00:00:14,024 --> 00:00:17,152
And, 
I wanted to talk a little bit about one 

5
00:00:17,152 --> 00:00:22,444
of the common places that you see 
something that's like a vector processor, 

6
00:00:22,444 --> 00:00:27,665
but isn't quite a vector processor. 
And that's in graphics processor units. 

7
00:00:27,665 --> 00:00:30,911
So, this is the graphics card in your 
computer. 

8
00:00:30,911 --> 00:00:34,297
So, my laptop here has a ETI graphics 
chip in it. 

9
00:00:34,297 --> 00:00:39,801
And it actually, I've, I've run on it 
open CL code compiled to it, which is a 

10
00:00:39,801 --> 00:00:43,760
general purpose usage of a graphics 
processor unit. 

11
00:00:43,760 --> 00:00:48,839
So, one of the things I wanted to kick 
this off by was saying that these 

12
00:00:48,839 --> 00:00:53,482
architectures look strange from a 
computer architecture perspective. 

13
00:00:53,482 --> 00:00:58,808
And it really goes back to the fact that 
they were not designed to be general 

14
00:00:58,808 --> 00:01:03,451
purpose computing machines. 
They were designed to render 3D graphics. 

15
00:01:03,451 --> 00:01:08,367
So, the early versions of these had very 
fixed pipelines, and they had no 

16
00:01:08,367 --> 00:01:11,918
programmability. 
So, you couldn't even use them to do 

17
00:01:11,918 --> 00:01:16,445
general purpose computation. 
And then, they start to get a little bit 

18
00:01:16,445 --> 00:01:20,303
more flexibility. 
So, an example of this actually was, the 

19
00:01:20,303 --> 00:01:24,030
original Nvidia chips had something 
called pixel shaders. 

20
00:01:24,030 --> 00:01:29,165
So, the idea is, on per pixel basis, as 
you render a three-dimensional picture 

21
00:01:29,165 --> 00:01:34,435
for like a game or something like that. 
Each pixel, you could say, oh, I want you 

22
00:01:34,435 --> 00:01:38,827
have some little custom, customization on 
how we render the pixel. 

23
00:01:38,827 --> 00:01:43,692
So, it's little, every, the game 
programmer, for instance, got to write a 

24
00:01:43,692 --> 00:01:48,828
little program for each pixel as they, as 
they got rendered to the screen. 

25
00:01:48,828 --> 00:01:55,193
So, ran, ran a little program. 
But what's interesting about this, and, 

26
00:01:55,193 --> 00:02:00,855
and these pixel shaders, 
there was actually a programming language 

27
00:02:00,855 --> 00:02:05,504
that developed for the pixel shaders. And 
the first implementation of general 

28
00:02:05,504 --> 00:02:10,214
purpose computing on these was people 
wrote pixel shaders to render on pixels, 

29
00:02:10,214 --> 00:02:14,803
but actually, did something else with it. 
So, they went and computed some, some 

30
00:02:14,803 --> 00:02:19,331
other program using these pixel shaders. 
So, you could try to do a matrix 

31
00:02:19,331 --> 00:02:22,290
multiplication and the result would be a 
picture. 

32
00:02:22,290 --> 00:02:26,988
[LAUGH] Kind of a strange idea there. 
It's like, well, we take one picture and 

33
00:02:26,988 --> 00:02:32,585
another picture, and we run some special 
shading on it that makes it look sort of 

34
00:02:32,585 --> 00:02:36,592
like a 3D picture. 
And all of a sudden, the output actually 

35
00:02:36,592 --> 00:02:40,636
is the result. 
Well, the graphics people, the people who 

36
00:02:40,636 --> 00:02:45,940
made graphics per graphics processing 
units got the bright idea that, well, 

37
00:02:45,940 --> 00:02:50,036
maybe we can make this a little bit 
easier and increase our user base, just 

38
00:02:50,036 --> 00:02:54,186
beyond having graphics cards if we 
encourage people to write programs on 

39
00:02:54,186 --> 00:02:56,644
these. 
So, they start to make it more and more 

40
00:02:56,644 --> 00:02:59,757
general purpose. 
So, instead of just being able to do a 

41
00:02:59,757 --> 00:03:04,235
pixel shader or work on one pixel at a 
time, and have it very pixel oriented, or 

42
00:03:04,235 --> 00:03:08,058
graphics oriented, we'll rename 
everything and we'll come up with some 

43
00:03:08,058 --> 00:03:12,154
programming model, and we'll expose some 
of the architecture, it'll make the 

44
00:03:12,154 --> 00:03:16,141
architecture a little more general 
purpose. And then, you might be able to 

45
00:03:16,141 --> 00:03:21,088
run some other programs on it. 
So, this is, this is really what, what 

46
00:03:21,088 --> 00:03:24,831
happened. 
And, and as I said, the, the, one of the 

47
00:03:24,831 --> 00:03:31,583
first people that tried use this and were 
trying to program it in the, the pixel 

48
00:03:31,583 --> 00:03:37,441
shaders and the, the, the per vector per 
vertex computations which were also sort 

49
00:03:37,441 --> 00:03:41,767
of baked into these original 
architectures were very hard. 

50
00:03:41,767 --> 00:03:46,067
because you're basically trying to think 
about everything as a picture in a frame, 

51
00:03:46,067 --> 00:03:48,713
and then sort of back compute what was 
going on. 

52
00:03:48,713 --> 00:03:53,179
But as we've moved along a little bit, 
we've started to see some new languages. 

53
00:03:53,179 --> 00:03:57,038
So, it's new program support, and the 
architectures become more general 

54
00:03:57,038 --> 00:04:00,211
purpose. 
So, this brings us to general purpose 

55
00:04:00,211 --> 00:04:05,010
graphics processing units. 
So, it's, we still have GPU here, so it's 

56
00:04:05,010 --> 00:04:09,661
still special purpose. 
But then, we stick GP in the front and we 

57
00:04:09,661 --> 00:04:14,017
call this GPGPUs, General Purpose 
Graphics Processing Units. 

58
00:04:14,017 --> 00:04:19,997
So, a good example of this actually is 
Nvidia decided to or, or came up with 

59
00:04:19,997 --> 00:04:24,280
this programming language called CUDA. 
Now, this was not 

60
00:04:26,440 --> 00:04:31,492
the first sort of foray into this. 
There are some research languages that 

61
00:04:31,492 --> 00:04:35,229
predated this, and some ideas that 
predated this. 

62
00:04:35,229 --> 00:04:40,628
But, the I, the idea here well we'll talk 
about that in a second actually. 

63
00:04:40,628 --> 00:04:44,862
But the, the GPGPUs 
the programming model's a little bit 

64
00:04:44,862 --> 00:04:47,576
strange. 
It's a, it's a threading model. 

65
00:04:47,576 --> 00:04:52,076
And we're going to talk about that in, in 
a little bit more detail. 

66
00:04:52,076 --> 00:04:58,075
But, first of all, I wanted to point out 
some differences between GPGPU and a true 

67
00:04:58,075 --> 00:05:02,075
vector processor. 
So, in a GPGPU, there's a host CPU, which 

68
00:05:02,075 --> 00:05:07,431
is your X86 processor in your computer. 
Then, you go across the bus, the PCIE 

69
00:05:07,431 --> 00:05:11,860
bus, or PCI bus, 
or EGP bus out onto a graphics card. 

70
00:05:13,080 --> 00:05:17,672
So, there's no, there is a host 
processor, but it's not, there's no 

71
00:05:17,672 --> 00:05:23,269
control processor like in our vector 
processors that we had talked about last 

72
00:05:23,269 --> 00:05:26,570
lecture. 
So, the scalar processor is really far 

73
00:05:26,570 --> 00:05:32,382
away, and doesn't drive everything as, as 
strictly, it's basically having two 

74
00:05:32,382 --> 00:05:37,764
processors running, the host processor, 
and then, the graphics processor unit. 

75
00:05:37,764 --> 00:05:43,504
And because of this, you actually have to 
run all of your control code on the 

76
00:05:43,504 --> 00:05:47,600
vector unit, somehow. 
So, 

77
00:05:47,600 --> 00:05:51,874
this this attached host processor model 
does have some advantages. 

78
00:05:51,874 --> 00:05:56,902
You can actually run the data parallel 
aspects of the program and then have the 

79
00:05:56,902 --> 00:06:01,804
host processor running something else, 
or the SA6 processor, which is connected 

80
00:06:01,804 --> 00:06:05,890
to it across the bus have it running some 
other, other program. 

81
00:06:05,890 --> 00:06:11,250
So, let's, let's dive into detail here 
and talk about the Compute Unified Device 

82
00:06:11,250 --> 00:06:14,690
Architecture, which is what CUDA stands 
for. 

83
00:06:14,690 --> 00:06:20,489
So, CUDA is, is the Nvidia way of 
programming, or the Nvidia programming 

84
00:06:20,489 --> 00:06:26,049
model for the graphics processors. 
And, there's a broader industrial 

85
00:06:26,049 --> 00:06:32,166
accepted way to program these, which uses 
roughly a similar programming model 

86
00:06:32,166 --> 00:06:38,680
called Open CL which is, the name there 
is designed to evoke notions of open GL, 

87
00:06:38,680 --> 00:06:44,400
which is the, the graphics language that 
is widely used for 3D rendering. 

88
00:06:45,420 --> 00:06:51,421
So, 
you have to suspend disbelief here for a 

89
00:06:51,421 --> 00:06:55,202
second. 
And, and this, this model is a little bit 

90
00:06:55,202 --> 00:06:58,984
odd, I think. 
But, let's, let's, let's talk about what, 

91
00:06:58,984 --> 00:07:00,800
what CUDA is. 
So, in CUDA, 

92
00:07:04,040 --> 00:07:09,047
let's say, we have a loop here. 
let's, let's say, a non-CUDA program 

93
00:07:09,047 --> 00:07:12,600
first. 
It's the upper portion of this code. 

94
00:07:12,600 --> 00:07:24,196
We're trying to take y plus or, or y of i 
plus x of i times some scalar value. 

95
00:07:24,196 --> 00:07:33,384
So, this is the traditional A times 
another vector plus a third vector. 

96
00:07:33,384 --> 00:07:42,280
This is actually a inner loop of this 
shows up an inner loop of impact. 

97
00:07:42,280 --> 00:07:47,000
So, is that, which is a, a benchmark that 
a lot of people run. 

98
00:07:48,320 --> 00:07:52,193
So, pretty soon what we were doing 
before. You're adding one vector to 

99
00:07:52,193 --> 00:07:56,612
another vector, except that you're taking 
one of the vectors and multiplying by a 

100
00:07:56,612 --> 00:07:58,640
scalar. 
Pretty, pretty simple. 

101
00:07:58,640 --> 00:08:05,246
So, in CUDA, the, the basic idea is that 
you don't do a lot of operation, you 

102
00:08:05,246 --> 00:08:11,811
don't do a lot of work per, what we're 
going to call threads. 

103
00:08:11,811 --> 00:08:21,720
So, what we're going to do here is we're 
going to define a block of data, 

104
00:08:21,720 --> 00:08:28,813
and we use some special tabs on here to 
say what's going on. 

105
00:08:28,813 --> 00:08:38,852
And, [COUGH] this block has some size. 
And then, what we do is, we define a 

106
00:08:38,852 --> 00:08:47,059
thread that can operate in parallel times 
these 256, or there's 256 data values and 

107
00:08:47,059 --> 00:08:52,337
256 threads. 
So, what's going to happen here is we're 

108
00:08:52,337 --> 00:09:01,236
basically going to do the same operation 
here, y times y of i plus a times x of i, 

109
00:09:01,236 --> 00:09:09,580
and we're going to store it into here. 
But, let's look where i comes from. 

110
00:09:10,820 --> 00:09:19,610
i comes from some special keywords here 
where it computes which thread number you 

111
00:09:19,610 --> 00:09:25,640
are or which index number you effectively 
are in this thread block. 

112
00:09:25,640 --> 00:09:31,502
[COUGH] And then, this if statement here 
is the moral equivalent of our strip 

113
00:09:31,502 --> 00:09:35,170
mining loop. 
So, this says, if we get above how ever 

114
00:09:35,170 --> 00:09:39,866
many, however much work we're trying to 
do, don't do anymore work. 

115
00:09:39,866 --> 00:09:44,048
So, we can block it into, let's say, some 
number of threads. 

116
00:09:44,048 --> 00:09:50,064
And then, we can actually pass into an n 
where n is the amount of work to actually 

117
00:09:50,064 --> 00:09:54,980
do or the length of the vector, and we 
can check that along the way. 

118
00:09:56,980 --> 00:10:02,707
What I find interesting about this is, if 
you look at this on first view, it looks 

119
00:10:02,707 --> 00:10:06,760
like each of these threads, oops, is 
completely independent. 

120
00:10:09,780 --> 00:10:14,992
So, that is the programming model, is 
that each of the threads are independent. 

121
00:10:14,992 --> 00:10:20,475
Unfortunately, the computer architecture 
that this is going to run on, each of the 

122
00:10:20,475 --> 00:10:25,078
threads are not independent. 
So, in CUDA, you don't want to have these 

123
00:10:25,078 --> 00:10:28,201
diverge. 
It's allowed to have them diverge. 

124
00:10:28,201 --> 00:10:31,457
So, for instance, there's a, there's a if 
statement in here. 

125
00:10:31,457 --> 00:10:35,398
So, you're going to have one go into the 
if statement, and one, one, a different 

126
00:10:35,398 --> 00:10:37,797
thread follow through on the if 
statement. 

127
00:10:37,797 --> 00:10:41,452
So, it is, it is allowed. 
But if you do that, you're basically just 

128
00:10:41,452 --> 00:10:44,860
not using one of the pipelines for a 
while. 

129
00:10:44,860 --> 00:10:50,466
So, I should of come back to this. 
Lets, let's talk about how this 

130
00:10:50,466 --> 00:10:55,281
programming model shows up. 
So, the programming model is having lots 

131
00:10:55,281 --> 00:11:01,102
of different little threads and the idea 
is you make sort of these micro threads. 

132
00:11:01,102 --> 00:11:06,995
And then, you want to take these micro 
threads and the run time system plus the 

133
00:11:06,995 --> 00:11:10,444
compiler, the CUDA compiler, will put it 
together. 

134
00:11:10,444 --> 00:11:15,690
And actually put all the threads 
together, and actually have them operate 

135
00:11:15,690 --> 00:11:19,140
on the exact same instructions at the 
same time. 

136
00:11:20,360 --> 00:11:28,460
So, they're hiding a single instruction 
multiple data architecture under the hood 

137
00:11:28,460 --> 00:11:35,035
of these threads. 
So, if we look at our example here, we do 

138
00:11:35,035 --> 00:11:39,921
a load. 
So, multiply a different load add in the 

139
00:11:39,921 --> 00:11:43,540
store, 
and that's, that's our a, x 

140
00:11:43,540 --> 00:11:52,960
a times x plus y operation going on here. 
And, across the other way, 

141
00:11:52,960 --> 00:11:57,036
these are all these threads are doing the 
same operation or effectively the same 

142
00:11:57,036 --> 00:12:02,726
operation. 
So, they call these single instruction, 

143
00:12:02,726 --> 00:12:05,881
multiple thread, which is kind of a funny 
idea. 

144
00:12:05,881 --> 00:12:10,480
In reality, it's actually single 
instruction multiple data. 

145
00:12:10,480 --> 00:12:15,686
But, they introduced this notion of 
threads with predication into the thread 

146
00:12:15,686 --> 00:12:20,550
in, in, into the single structure 
multiple data to allow, effectively, one 

147
00:12:20,550 --> 00:12:25,600
pipe, one microthread here to do 
something slightly different, 

148
00:12:25,600 --> 00:12:33,764
by predication. 
So, what is the implications of the 

149
00:12:33,764 --> 00:12:37,220
single instruction multiple thread? 
Well, 

150
00:12:38,820 --> 00:12:44,880
strangely enough, because you have it's 
hard to control the order of the threads 

151
00:12:44,880 --> 00:12:50,094
relative to each other with the data, the 
memory system has to support all 

152
00:12:50,094 --> 00:12:55,450
different notions and all different 
alignments of scatter-gather operations. 

153
00:12:55,450 --> 00:13:00,320
So, they don't actually try to control 
the addressing because each of the 

154
00:13:00,320 --> 00:13:04,256
threads could potentially try to do some 
scatter operation. 

155
00:13:04,256 --> 00:13:09,526
So instead, what they do is, they have 
some really smart intelligence that takes 

156
00:13:09,526 --> 00:13:13,195
all the addresses that come out of the 
execution units. 

157
00:13:13,195 --> 00:13:18,332
And they say, oh, these look like they 
should line up and we'll issue these at 

158
00:13:18,332 --> 00:13:22,068
the same time. 
So, if you happen to have threads which 

159
00:13:22,068 --> 00:13:26,624
all try to do, let's say, 
a of i, we'll say, where i is a thread 

160
00:13:26,624 --> 00:13:30,198
number, 
then you have units trying operations. 

161
00:13:30,198 --> 00:13:36,228
If you try to pack them all together and 
the hardware actually goes and tries to 

162
00:13:36,228 --> 00:13:43,540
figure this out. 
And, as I had mentioned before, if you, 

163
00:13:43,540 --> 00:13:48,235
you have to use predication here if you 
have different control flow. 

164
00:13:48,235 --> 00:13:52,930
So, you need strong, strong use of 
predication to allow threads to go 

165
00:13:52,930 --> 00:13:59,748
different directions. 
So, things get even more complicated in 

166
00:13:59,748 --> 00:14:06,202
these architectures. These GPU, general 
purpose GPU architectures 

167
00:14:06,202 --> 00:14:11,064
and take the word warp here and replace 
it with thread. 

168
00:14:11,064 --> 00:14:18,049
so unfortunately, if you, if you go read 
your textbook, they have a nice table 

169
00:14:18,049 --> 00:14:24,277
which translates GPGPU nomenclature to 
the whole rest of computer architecture 

170
00:14:24,277 --> 00:14:27,481
nomenclature. 
But, the GPU people came up with 

171
00:14:27,481 --> 00:14:32,892
completely different names for everything 
which is just kind of annoying. 

172
00:14:32,892 --> 00:14:36,096
Because names already existed for 
everything. 

173
00:14:36,096 --> 00:14:39,956
So, 
if you go look inside of one these GPUs, 

174
00:14:39,956 --> 00:14:47,249
they actually are a massively parallel 
architecture with multiple lanes, like 

175
00:14:47,249 --> 00:14:52,627
our vector architecture. 
And then, on top of that, they are a 

176
00:14:52,627 --> 00:14:58,033
multi-threaded architecture. 
Typically, these architectures don't have 

177
00:14:58,033 --> 00:15:02,341
caches. So, to hide a lot of the memory 
latency what they'll do is you'll take 

178
00:15:02,341 --> 00:15:06,705
all the threads that are active in the 
machine, and this is part of the reason 

179
00:15:06,705 --> 00:15:11,237
they have this threading model, is you'll 
take all the threads that are active in 

180
00:15:11,237 --> 00:15:15,265
the machine and you'll schedule one 
thread. And if that thread, let's say, 

181
00:15:15,265 --> 00:15:19,115
misses out to memory, 
you'll time slice it out and then 

182
00:15:19,115 --> 00:15:23,445
schedule a different thread. 
So, that actually we'll fine grain 

183
00:15:23,445 --> 00:15:31,960
interweave threads on a functional unit. 
So, it's a strange idea here mixing 

184
00:15:31,960 --> 00:15:40,233
multi-string with SIM D at the same time. 
So, lots of different parallels and 

185
00:15:40,233 --> 00:15:47,257
aspects coming together in these GPGPUs. 
I don't want to go into that much detail 

186
00:15:47,257 --> 00:15:51,555
because then you have a whole class on 
how to program GPGPU's. 

187
00:15:51,555 --> 00:15:57,150
But basic idea I wanted to get across is 
that they are a multi-threaded single 

188
00:15:57,150 --> 00:16:02,540
instruction multiple-data machine, but 
they overlay on top of that, this strange 

189
00:16:02,540 --> 00:16:06,292
notion of threads. 
And, but, the threads don't do exactly 

190
00:16:06,292 --> 00:16:09,226
the same work because it's a SIM D 
machine. 

191
00:16:09,226 --> 00:16:18,825
you basically just end up wasting slots. 
So, these have a lot of performance. 

192
00:16:18,825 --> 00:16:21,818
So, some examples of this 
the Nvidia, 

193
00:16:21,818 --> 00:16:28,495
this is actually the modern day, Nvidia 
computer archi, or GPU architecture you 

194
00:16:28,495 --> 00:16:31,104
can go buy. 
They call it the Fermi. 

195
00:16:31,104 --> 00:16:36,169
this is in that card I showed last time 
with the, the, the Tesla. 

196
00:16:36,169 --> 00:16:42,840
let's zoom in on one of these and talk 
about what's inside of here. 

197
00:16:43,920 --> 00:16:48,720
So, roughly what's going on is they 
actually have, 

198
00:16:48,720 --> 00:16:53,520
well, first of all, let's see the stuff 
that's not programmable, which is 

199
00:16:53,520 --> 00:16:56,901
actually a significant portion of the 
design here. 

200
00:16:56,901 --> 00:17:01,904
If you look down here, they have vertex 
shaders and tessellation units and 

201
00:17:01,904 --> 00:17:06,299
texture mapping units, 
texture, texture caches And really, what 

202
00:17:06,299 --> 00:17:09,748
this is all for, is this if for graphics 
processing. 

203
00:17:09,748 --> 00:17:13,669
[LAUGH] And then, [LAUGH] we sort of 
smush onto that, some array of general 

204
00:17:13,669 --> 00:17:16,780
purpose units or mildly general purpose 
units. 

205
00:17:18,880 --> 00:17:24,854
And inside of each one of these cores 
here, there's a floating point unit and 

206
00:17:24,854 --> 00:17:29,588
an integer unit. 
And if you cut this way, that's actually 

207
00:17:29,588 --> 00:17:34,631
each one of these here what they call 
core, is effectively a lane. 

208
00:17:34,631 --> 00:17:40,296
So, they're replicated in that direction. 
And then, one, two, three, four, five, 

209
00:17:40,296 --> 00:17:43,400
six, seven, eight, this direction is SIM 
D. 

210
00:17:44,560 --> 00:17:53,040
So, they basically have a SIM D 
architecture with multiple lanes. 

211
00:17:53,040 --> 00:17:58,570
So, lots of parallelism going on. 
And then, at the top here, they have what 

212
00:17:58,570 --> 00:18:04,024
they call the warp scheduler which is the 
thread scheduler, which will assign 

213
00:18:04,024 --> 00:18:08,740
instructions down into the different 
parallel units. 

214
00:18:08,740 --> 00:18:13,478
So, lots of, lots of interesting things 
going on in parallel here on these 

215
00:18:13,478 --> 00:18:19,604
machines. 
So, I wanted to stop here on GPUs because 

216
00:18:19,604 --> 00:18:23,288
we could have a whole another lecture on 
that. 

217
00:18:23,288 --> 00:18:28,093
I, I don't really want to go into that 
level of detail on GPUs. 

218
00:18:28,093 --> 00:18:34,240
But, let's switch topics and start 
talking about multi-threading. 

219
00:18:34,240 --> 00:18:37,977
Actually, before we go off this thing, 
I'm sure people have questions about 

220
00:18:37,977 --> 00:18:40,351
GPUs. 
But I'll take one or two of them, but I 

221
00:18:40,351 --> 00:18:44,240
don't want to go into that much detail. 
I just want to sort of introduce the idea 

222
00:18:44,240 --> 00:18:48,281
that you could use graphics processors. 
They have some similarity to vectors, but 

223
00:18:48,281 --> 00:18:51,160
they're kind of the degenerate case of 
vector processors. 

224
00:18:55,620 --> 00:19:01,136
Actually, a quick show of hands. 
Who, who, who has a ATI video card in 

225
00:19:01,136 --> 00:19:06,415
their machine? 
Okay. Who has Nvidia? 

226
00:19:06,415 --> 00:19:12,421
Okay. Who has Intel? 
Aha. So, so, interesting tidbit here. 

227
00:19:12,421 --> 00:19:20,158
Everyone always thinks about ATI and 
Nvidia as being the sort of leaders in 

228
00:19:20,158 --> 00:19:24,165
these 
fancy graphics cards. 

229
00:19:24,165 --> 00:19:28,485
But in reality, 
Intel sells the most number of graphics 

230
00:19:28,485 --> 00:19:32,110
processors in the world today. 
And it's partially because they kind of 

231
00:19:32,110 --> 00:19:36,471
give them away for free, and they've 
effectively integrated them onto all the 

232
00:19:36,471 --> 00:19:37,547
Intel chips now. 
So, 

233
00:19:37,547 --> 00:19:42,418
it's kind of a funny thing that the least 
innovative the, the, the least exciting 

234
00:19:42,418 --> 00:19:46,779
graphics processors out there are, are 
kind of there not because they're good, 

235
00:19:46,779 --> 00:19:50,573
but because they're cheap. 
Lots of, lots of things in the world work 

236
00:19:50,573 --> 00:19:51,140
like that.