1
00:00:02,540 --> 00:00:11,420
But before, before we do that, I wanted 
to talk about programming and a 

2
00:00:11,420 --> 00:00:15,380
competitor to using shared memory for 
programming. 

3
00:00:18,400 --> 00:00:24,269
So, so far, we have looked at a shared 
memory architecture to communicate from 

4
00:00:24,269 --> 00:00:28,484
one core to another core. 
So, what does that means is one core does 

5
00:00:28,484 --> 00:00:33,402
a store to some memory address. And 
sometime in the future, another core goes 

6
00:00:33,402 --> 00:00:36,340
to read that memory address and gets the 
data. 

7
00:00:39,360 --> 00:00:46,394
What's interesting about that paradigm is 
the sender of the data, 

8
00:00:46,394 --> 00:00:50,118
the 
core which is writing the data or doing 

9
00:00:50,118 --> 00:00:54,716
the store does not need to know which 
core is going to read the data in the 

10
00:00:54,716 --> 00:01:00,730
future. 
Or it's very possible that no one will go 

11
00:01:00,730 --> 00:01:07,282
read that data in the future. 
And in shared memory architectures, that 

12
00:01:07,282 --> 00:01:13,820
doesn't, that doesn't affect anything. 
We can actually have one core writes a 

13
00:01:13,820 --> 00:01:18,560
piece of data in the memory and some 
address and given this implicit name, we 

14
00:01:18,560 --> 00:01:23,060
effectively have some associative match 
where you can look up based on the 

15
00:01:23,060 --> 00:01:27,800
address, the piece of data and go find 
the piece of data sometime in the future. 

16
00:01:27,800 --> 00:01:32,720
Now, you probably need to use locking and 
to, to guaranteed causality between one 

17
00:01:32,720 --> 00:01:37,340
processor writing the data and another 
processor reading the data in shared 

18
00:01:37,340 --> 00:01:41,920
memory but you don't need to know the 
destination. 

19
00:01:41,920 --> 00:01:47,825
In contrast, we're going to talk about 
explicit message passing as a programming 

20
00:01:47,825 --> 00:01:50,662
model. 
Now, there are lots of hardware 

21
00:01:50,662 --> 00:01:53,883
implementations, you can go implement 
this. 

22
00:01:53,883 --> 00:02:00,248
But let's start off by talking about 
exclusive messaging as a program model to 

23
00:02:00,248 --> 00:02:04,620
communicate between different threads, 
processes, or cores. 

24
00:02:05,760 --> 00:02:09,520
[COUGH] And what we're going to have here 
is we're going to add, 

25
00:02:09,520 --> 00:02:16,610
we're going to, we're going to have an 
API or a, a programming interface here of 

26
00:02:16,610 --> 00:02:22,030
send, 
where it names the destination, and 

27
00:02:22,030 --> 00:02:31,844
passes into it, a pointer to some data. 
And in our messaging API, this is going 

28
00:02:31,844 --> 00:02:39,540
to take this data and somehow get it to 
the receiver or the destination. 

29
00:02:41,020 --> 00:02:47,120
On the receive side, 
the most basic thing they can do is you 

30
00:02:47,120 --> 00:02:54,137
can receive data. 
Note this receive here does not denote 

31
00:02:54,137 --> 00:02:59,440
who it's receiving data from. 
So, a common of thing you can do is just 

32
00:02:59,440 --> 00:03:04,359
receive in order, receive the next piece 
of data that it gets in. 

33
00:03:04,359 --> 00:03:10,584
Now, that may not be super useful so you 
may want to extend this to having the 

34
00:03:10,584 --> 00:03:15,196
receive actually take a, a source also. 
But it's not required. 

35
00:03:15,196 --> 00:03:21,421
There are the reason I'm, I'm talking 
about this in this abstract fashion here 

36
00:03:21,421 --> 00:03:27,656
is there's many different explicit 
message passing programming interfaces 

37
00:03:27,656 --> 00:03:33,223
because it's just software that people 
have implemented over, over the years. 

38
00:03:33,223 --> 00:03:38,204
but at the, at the fundamental level, 
there's a send and a receive. 

39
00:03:38,204 --> 00:03:41,324
And you could think about having the 
send, 

40
00:03:41,324 --> 00:03:46,713
or the send requires a destination and 
receive can just receive of, of, 

41
00:03:46,713 --> 00:03:51,717
basically the input queue or it would try 
to receive from a pre-recorded source, or 

42
00:03:51,717 --> 00:03:56,452
another way to do this is what they do in 
in MPI, which is called the public 

43
00:03:56,452 --> 00:04:00,965
message passing interface, which is a 
probably the most common programming 

44
00:04:00,965 --> 00:04:05,969
interface for messaging, is it won't take 
a source, but instead, it will take a 

45
00:04:05,969 --> 00:04:11,572
tag, which is effectively a number which 
says, what message type is the or not 

46
00:04:11,572 --> 00:04:12,921
message type, 
sorry, 

47
00:04:12,921 --> 00:04:17,039
what, what traffic flow is it, 
you sort of name the flow. 

48
00:04:17,039 --> 00:04:21,726
This is kind of a tag is, is sort of a 
more generalized form of, 

49
00:04:21,726 --> 00:04:24,921
in TCP/IP, they have ports, for instance, 
right? 

50
00:04:24,921 --> 00:04:28,190
And UDP has ports. 
Okay. 

51
00:04:28,190 --> 00:04:32,372
So, questions so far about this basic 
program model and why is it different 

52
00:04:32,372 --> 00:04:35,160
than loads and stores through shared 
addresses. 

53
00:04:37,820 --> 00:04:41,360
Very different. 
They are jewels of each other. 

54
00:04:41,360 --> 00:04:45,173
I'm going to tell you right now that you, 
if you can implement one thing of shared 

55
00:04:45,173 --> 00:04:48,683
memory, you can implement messaging and 
vice versa, and there have been lots of 

56
00:04:48,683 --> 00:04:52,106
fights over this, and at some point or, 
or, the whole community realized that you 

57
00:04:52,106 --> 00:04:55,010
can do either with either. 
You can do any programming model, you can 

58
00:04:55,010 --> 00:05:07,652
do one and the other. 
Okay. So, the default message type we're 

59
00:05:07,652 --> 00:05:22,041
going to talk about is unicast. 
So, that's one-to-one. 

60
00:05:22,041 --> 00:05:27,299
If you ascend to a destination, the 
destination is one other node or one 

61
00:05:27,299 --> 00:05:27,299
other process, 
or one other thread. 

62
00:05:27,299 --> 00:05:27,299
The reason that I say, process, thread, 
and node is depending on how API is 

63
00:05:27,299 --> 00:05:27,299
structured. 
You might have, you could use a, a 

64
00:05:27,299 --> 00:05:27,299
messaging, exclusive message passing 
system on something like a uniprocessor. 

65
00:05:27,299 --> 00:05:27,299
We have different threads or different 
processes running and you are only 

66
00:05:27,299 --> 00:05:27,299
messaging for one process to another 
process even though it is one core. 

67
00:05:31,000 --> 00:05:33,682
But there's a pretty natural extension of 
that. 

68
00:05:33,682 --> 00:05:37,756
You can take those two processes and put 
them on different cores, and have them 

69
00:05:37,756 --> 00:05:43,089
communicate using the same interface. 
We could also do this between two 

70
00:05:43,089 --> 00:05:45,944
threads, but that's less common. 
Okay. 

71
00:05:45,944 --> 00:05:50,777
So, that, so unicast is one-to-one. 
Some messaging networks and some 

72
00:05:50,777 --> 00:05:56,049
messaging primitives will actually 
program models will have a couple of 

73
00:05:56,049 --> 00:05:59,125
other choices here. 
We can have multicast. 

74
00:05:59,125 --> 00:06:04,983
So,- this is communication from one node, 
but instead of having a destination here 

75
00:06:04,983 --> 00:06:10,255
which denotes one other location, it can 
denote a set of other people to 

76
00:06:10,255 --> 00:06:13,805
communicate with. 
It is still messaging except for 

77
00:06:13,805 --> 00:06:18,776
destination, we've expanded our notion of 
destination here so it can be a set, not 

78
00:06:18,776 --> 00:06:26,200
just a single thing. 
[COUGH] We can also have broadcast, 

79
00:06:26,200 --> 00:06:30,884
where you can have one node, instead of 
having destination here, you can have 

80
00:06:30,884 --> 00:06:35,323
some magical flags which says, 
communicate this with every other node in 

81
00:06:35,323 --> 00:06:40,069
the system or every other process in the 
system or every other threat in the 

82
00:06:40,069 --> 00:06:44,816
system or every other process in that 
process group or thread group there is. 

83
00:06:44,816 --> 00:06:48,700
There's ways to sort of say that in these 
software protocols. 

84
00:06:49,840 --> 00:06:56,300
just for a quick question here. 
Has anyone here programmed an MPI? 

85
00:06:58,340 --> 00:07:03,016
Okay. And once in a while, perform a 
programming. 

86
00:07:03,016 --> 00:07:05,898
Good. So, 
we'll talk briefly. 

87
00:07:05,898 --> 00:07:12,931
I'm going to show a brief code example 
here of one process communicating with 

88
00:07:12,931 --> 00:07:19,047
another process via a very common 
messaging library called the message 

89
00:07:19,047 --> 00:07:26,615
passing interface or MPI for short. 
So, let's look at this piece of code. 

90
00:07:26,615 --> 00:07:33,255
It's a C code and what's happening here 
is, okay we start our program, 

91
00:07:33,255 --> 00:07:38,378
we have my ID. 
So, the, the program model of MPI is that 

92
00:07:38,378 --> 00:07:46,080
it will actually start the same process 
on multiple cores or multiple processes 

93
00:07:46,080 --> 00:07:53,508
or multiple not, not, it's not threads 
but multiple processes or multiple cores 

94
00:07:53,508 --> 00:08:00,936
will start up the exact same program. 
So, this is sometimes called SPMD or 

95
00:08:00,936 --> 00:08:04,790
single program multiple data program 
model. 

96
00:08:04,790 --> 00:08:08,141
SPMD. 
So, the SPMD model here, what you're 

97
00:08:08,141 --> 00:08:13,922
going to notice is that actually we're 
going to execute this program twice. 

98
00:08:13,922 --> 00:08:18,380
One on the one core, and one on the 
second core. 

99
00:08:18,380 --> 00:08:24,041
Now, you're going to ,, well, if you're 
executing exact same program, are they 

100
00:08:24,041 --> 00:08:28,042
going to do the exact same thing? 
Well, let's take a look. 

101
00:08:28,042 --> 00:08:31,062
Okay. 
So, we have this thing called my ID. 

102
00:08:31,062 --> 00:08:34,233
This is what tells us in what core we 
are. 

103
00:08:34,233 --> 00:08:40,688
And there's something here called 
numprocs, which is going to tell us how 

104
00:08:40,688 --> 00:08:43,708
many processors there are in the 
computer. 

105
00:08:43,708 --> 00:08:48,095
And we fill these in. 
So, you see here, we start off by calling 

106
00:08:48,095 --> 00:08:53,057
MPI INT MPI INT sets up the message 
passing system from a software 

107
00:08:53,057 --> 00:08:57,713
perspective. 
Then we call MPI communication size with 

108
00:08:57,713 --> 00:09:01,216
some magic parameters. 
And it's going to fill in this field, 

109
00:09:01,216 --> 00:09:05,550
which tells us how many processors there 
are in this, in this MPI program. 

110
00:09:05,550 --> 00:09:10,304
When you launch an MPI program, there's a 
special MPI Start command that you will 

111
00:09:10,304 --> 00:09:14,887
have to run, which will actually, is, it 
takes a parameter of how many it takes 

112
00:09:14,887 --> 00:09:19,527
parameter of programmer trying to run and 
multiple processors that you are trying 

113
00:09:19,527 --> 00:09:23,423
to run, and the number of processes that 
you are trying to run it on. 

114
00:09:23,423 --> 00:09:27,719
So, our program now can dynamically 
detect how many processors were running 

115
00:09:27,719 --> 00:09:30,441
on. 
It can also detect what's called our 

116
00:09:30,441 --> 00:09:35,474
rank, which is our ID. 
So now, we can actually detect before the 

117
00:09:35,474 --> 00:09:41,805
first process or the second process 
launched in a two-process system or which 

118
00:09:41,805 --> 00:09:45,880
one we are, and will fill in my ID with a 
rank. 

119
00:09:45,880 --> 00:09:49,388
And this is a library which will figure 
this out for us. 

120
00:09:49,388 --> 00:09:51,707
In this, this MPI implementation. 
Okay. 

121
00:09:51,707 --> 00:09:55,278
So, we are assert now. 
We say the number of procs equals two. 

122
00:09:55,278 --> 00:09:59,915
this is just to make sure we're not 
running on three processors or ten 

123
00:09:59,915 --> 00:10:04,301
processors or something like that. 
We want strictly two and And not one, and 

124
00:10:04,301 --> 00:10:06,995
not 
so that will, that will fail if we get a 

125
00:10:06,995 --> 00:10:08,185
problem there. 
Okay. 

126
00:10:08,185 --> 00:10:11,693
So now, this is, this is where we can do 
different things. 

127
00:10:11,693 --> 00:10:16,078
So, single program multiple data. 
The reason it's called this is we can 

128
00:10:16,078 --> 00:10:20,087
actually make decisions based on our 
processor ID or my ID here. 

129
00:10:20,087 --> 00:10:24,723
So, we'll do something which says is my 
ID zero? Do this else, do this other 

130
00:10:24,723 --> 00:10:29,108
piece of code, 
okay? 

131
00:10:29,108 --> 00:10:34,260
So, is everyone want to wage your guess 
here what this program does so far? 

132
00:10:38,720 --> 00:10:42,360
So, the one, one processor is going to be 
executing this, the other processor is 

133
00:10:42,360 --> 00:10:48,700
going to be executing that. 
First processor executes a send. 

134
00:10:50,300 --> 00:10:55,020
[COUGH] Let's, let's look at this first 
line here and see what happens. 

135
00:10:56,160 --> 00:11:01,356
This is the data we're trying to send so 
we're sending x, which is an integer, 

136
00:11:01,356 --> 00:11:06,080
which we load with the number 475, which 
is our class, or course number. 

137
00:11:07,540 --> 00:11:14,611
[COUGH] And we send that with a 
particular tag. 

138
00:11:14,611 --> 00:11:20,014
So, MPI is structured around this notion 
of tags, 

139
00:11:20,014 --> 00:11:27,220
which is how you can effectively connect 
up a sender and a receiver. 

140
00:11:28,980 --> 00:11:34,646
[COUGH] Oh, sorry. 
That's how you can connect up multi, 

141
00:11:34,646 --> 00:11:38,403
multiple pieces of traffic between 
senders and receivers. 

142
00:11:38,403 --> 00:11:41,765
The other way that you figure out is by 
looking at, 

143
00:11:41,765 --> 00:11:45,127
there's a 
a number of who you're sending and 

144
00:11:45,127 --> 00:11:51,012
receiving to, so this is the core number. 
So, what this says here is send x to rank 

145
00:11:51,012 --> 00:11:55,827
number one with tag, tag, 
we're going to say tag is 475 also, 

146
00:11:55,827 --> 00:12:00,560
and this MPI rest of this stuff here, 
don't worry about. 

147
00:12:01,800 --> 00:12:06,826
And this is, this is a length. 
So, this is going to say, send one word 

148
00:12:06,826 --> 00:12:11,146
of size integer to processor one with a 
particular tag. 

149
00:12:11,146 --> 00:12:17,036
And this MPI COMM WORLD basically is 
extra flags you can pass there which 

150
00:12:17,036 --> 00:12:22,927
says, do I want to send to ev, all other 
MPI sub processes or you can make 

151
00:12:22,927 --> 00:12:26,775
subgroups? 
There's this notion of groups that MPI 

152
00:12:26,775 --> 00:12:32,678
has but it's kind of, more complicated. 
Okay. So, at the same time, this other 

153
00:12:32,678 --> 00:12:37,223
program is executing here. 
The first thing it goes to do is it 

154
00:12:37,223 --> 00:12:40,448
receives. 
So, if it gets here first, it's just 

155
00:12:40,448 --> 00:12:48,608
going to block and wait in this receive. 
Conveniently, this first processor did 

156
00:12:48,608 --> 00:12:56,386
ascend, and has a matched receive here. 
So, it's going to receive with tag 475 

157
00:12:56,386 --> 00:13:04,120
length one of an MPI integer from zero, 
and return as a status code. 

158
00:13:05,480 --> 00:13:11,000
So, it's going to fill in y with the data 
that got sent on this message. 

159
00:13:11,000 --> 00:13:14,512
That's pretty fun. 
We just communicate information between 

160
00:13:14,512 --> 00:13:17,788
two processes. 
Okay. So, let's, let's look through 

161
00:13:17,788 --> 00:13:23,336
what's happening here. 
Process zero as you do the send. 

162
00:13:23,336 --> 00:13:28,840
So, it's going to send 475 to processor 
one. 

163
00:13:30,860 --> 00:13:34,621
[COUGH] Processor one could have gotten 
here early, at this receive. 

164
00:13:34,621 --> 00:13:38,160
And processor one does not execute any of 
this code. 

165
00:13:38,160 --> 00:13:41,842
And it has different address base, x's 
and y's are different, 

166
00:13:41,842 --> 00:13:46,445
we have no shared memory here. 
It gets here, and it's going to do 

167
00:13:46,445 --> 00:13:51,891
receive. And at some point, the send 
shows up and the, the message shows up to 

168
00:13:51,891 --> 00:13:55,344
the receiver. 
It gets received and put into y. 

169
00:13:55,344 --> 00:14:00,020
So, it just fills it in which is why we 
pass the address of y. 

170
00:14:00,020 --> 00:14:03,781
[COUGH] Now, we're going to increment y 
by 105, 

171
00:14:03,781 --> 00:14:10,545
so we did some computation on this node. 
At the, at the same time while were, 

172
00:14:10,545 --> 00:14:15,334
we're doing this receive, and we're doing 
this increment, and we're doing this 

173
00:14:15,334 --> 00:14:18,106
send, 
process zero is basically just sitting 

174
00:14:18,106 --> 00:14:21,060
here waiting, 
trying to receive. 

175
00:14:21,060 --> 00:14:23,797
So, we're doing, we're having process 
zero. 

176
00:14:23,797 --> 00:14:28,138
Sends some data to process one. 
Process one is going to do some math on 

177
00:14:28,138 --> 00:14:33,520
it that's going to increment the number, 
and it's going to send that number back. 

178
00:14:33,520 --> 00:14:36,723
So, it does the send back here which have 
y again. 

179
00:14:36,723 --> 00:14:43,291
And we're going to send y to process zero 
of length one, and when the message shows 

180
00:14:43,291 --> 00:14:46,975
up, we're going to receive in the process 
zero, 

181
00:14:46,975 --> 00:14:51,880
and process zero is going to print out a 
message. 

182
00:14:51,880 --> 00:14:53,918
okay. 
What, what does it say? 

183
00:14:53,918 --> 00:14:58,131
Received number 580. 
Yes, it should go [UNKNOWN] 580 A, 

184
00:14:58,131 --> 00:15:03,636
not 580 not 580 without an A. 
Because 580 without an A is not the 

185
00:15:03,636 --> 00:15:07,985
parallel programming class, 
that's the security class next to it. 

186
00:15:07,985 --> 00:15:11,858
But 580 A, you should go to it. 
So, it prints out the number. 

187
00:15:11,858 --> 00:15:16,819
So, it can communicate from the one 
process to the other process and back 

188
00:15:16,819 --> 00:15:20,625
with messaging. 
So, one of the things we should note here 

189
00:15:20,625 --> 00:15:27,311
is, we are both moving data and we are 
synchronizing at the same time in this 

190
00:15:27,311 --> 00:15:30,553
message. 
So, it's, it's doing two different 

191
00:15:30,553 --> 00:15:33,477
operations here. 
So, this is a great question. 

192
00:15:33,477 --> 00:15:36,201
So, this is a programming model right 
now. 

193
00:15:36,201 --> 00:15:41,119
We're not going, you know, there's many 
different ways to implement MPI and 

194
00:15:41,119 --> 00:15:47,798
people to implement MPI. 
The interface of these MPI send, MPI 

195
00:15:47,798 --> 00:15:54,455
receive, are many different ways. 
So, people have implement this, implement 

196
00:15:54,455 --> 00:15:58,721
this over a message, hardware message 
passing network, which does not have to 

197
00:15:58,721 --> 00:16:02,145
go into the OS, for instance. 
You send and it actually, there's, 

198
00:16:02,145 --> 00:16:06,299
effectively hardware there which 
implements MPI send and receive, sends it 

199
00:16:06,299 --> 00:16:10,510
out over the networks and receives it on 
the receive side and there's some 

200
00:16:10,510 --> 00:16:13,541
interconnection network in the middle. 
That's one way. 

201
00:16:13,541 --> 00:16:18,620
Another kind of way is that you actually 
run MPI over a shared memory machine. 

202
00:16:18,620 --> 00:16:22,755
At which point, MPI send. 
This is going to sound kind of funny. 

203
00:16:22,755 --> 00:16:27,486
It basically takes the data, copies it 
into RAM somewhere and then when the 

204
00:16:27,486 --> 00:16:31,325
receiver goes to receive it, it looks to 
open in the hash table, finds the 

205
00:16:31,325 --> 00:16:35,164
location in RAM and doesn't reach from 
that pointer and that's, that's an 

206
00:16:35,164 --> 00:16:39,160
option. And actually, that is one of the 
most common ways that people run MPI 

207
00:16:39,160 --> 00:16:43,420
today is, that doesn't have to go in to 
the OS because it's just shared memory. 

208
00:16:43,420 --> 00:16:46,994
on small machines, people run MPI that 
way. 

209
00:16:46,994 --> 00:16:51,159
And it actually has good, 
has typically had a better performance 

210
00:16:51,159 --> 00:16:54,960
than sort of the equivalent thing of 
writing the short memory program. 

211
00:16:54,960 --> 00:16:58,602
And the reason for that is you've 
effectively, explicitly made the 

212
00:16:58,602 --> 00:17:01,527
communication or you've made the 
communication explicit. 

213
00:17:01,527 --> 00:17:05,501
So, that coherence protocol, you're 
basically going to be build to optimize 

214
00:17:05,501 --> 00:17:08,591
for producer-consumer. 
You have a right to some location. 

215
00:17:08,591 --> 00:17:12,896
And then, someone else is going to read 
it, your not going to have random false 

216
00:17:12,896 --> 00:17:15,601
sharing problems, your not going to lots 
other problems. 

217
00:17:15,601 --> 00:17:19,298
So, that's pretty common. 
where MPI has the biggest use today is in 

218
00:17:19,298 --> 00:17:23,272
massively peril computations. 
So, this is sort of the supercomputers of 

219
00:17:23,272 --> 00:17:26,743
the world. 
sort of the, all the supercomputers minus 

220
00:17:26,743 --> 00:17:30,741
the vector of supercomputers who 
typically don't use MPI, 

221
00:17:30,741 --> 00:17:35,592
but the massively parallel computers. 
So, like the biggest computers in the 

222
00:17:35,592 --> 00:17:37,740
world right now, 
the 

223
00:17:37,740 --> 00:17:42,575
I don't know, 
the Roadrunner computer at Los Alamos, 

224
00:17:42,575 --> 00:17:48,306
for instance, and the similar sorts of 
computers like that. 

225
00:17:48,306 --> 00:17:54,195
They will have special network cards that 
implement MPI effectively and they do it 

226
00:17:54,195 --> 00:17:57,275
in user space. 
Now, another way to do it is that you can 

227
00:17:57,275 --> 00:18:01,139
actually implement MPI over TCP/IP going 
through the operating system. 

228
00:18:01,139 --> 00:18:05,675
And that's pretty common when people just 
run small clusters of computers is MPI 

229
00:18:05,675 --> 00:18:08,642
send and receive will trap into the 
operating system. 

230
00:18:08,642 --> 00:18:12,898
It will go into the library first and 
then up, the library will watch you read 

231
00:18:12,898 --> 00:18:16,538
and write to the sockets. 
At some point, that will go out over the 

232
00:18:16,538 --> 00:18:19,338
network card. 
All of those are possible and all of 

233
00:18:19,338 --> 00:18:23,416
those have very optimized MPI 
implementations because this is the most 

234
00:18:23,416 --> 00:18:28,810
widely used parallel messaging library 
for high performance application and has 

235
00:18:28,810 --> 00:18:33,305
the largest [UNKNOWN]. 
So, people have optimized this quite a 

236
00:18:33,305 --> 00:18:34,917
bit. 
Okay. 

237
00:18:34,917 --> 00:18:39,199
So, this, this brings us to the question 
of, a lot of words in this slide, but I 

238
00:18:39,199 --> 00:18:43,480
do want to sort of talk about this, of 
message passing versus shared memory. 

239
00:18:43,480 --> 00:18:46,840
So, we have two different parallel 
programming models. 

240
00:18:46,840 --> 00:18:51,120
Message passing, where you have to 
explicitly name your destinations. 

241
00:18:51,120 --> 00:18:55,195
And shared memory, where you went to some 
random location at some point in the 

242
00:18:55,195 --> 00:18:59,113
future, someone goes and reads from it 
and you don't know who the reader is. 

243
00:18:59,113 --> 00:19:01,621
It could be any of the processors in the 
system. 

244
00:19:01,621 --> 00:19:04,756
If, you, you could just read around them 
it could be random. 

245
00:19:04,756 --> 00:19:08,883
It could be literally random like you 
read some random number and it tells you 

246
00:19:08,883 --> 00:19:11,130
whose going to do a reading of that 
location. 

247
00:19:11,130 --> 00:19:15,358
But if you think about it, because you 
have any, you can write two locations 

248
00:19:15,358 --> 00:19:18,000
someone else can go read it in shared 
memory. 

249
00:19:18,000 --> 00:19:22,170
You are effectively having a one to all 
communication versus message passing 

250
00:19:22,170 --> 00:19:26,457
allows your microarchitecture and your 
system to do the optimize around 

251
00:19:26,457 --> 00:19:29,286
point-to-point communication. 
Okay. 

252
00:19:29,286 --> 00:19:35,416
So, let's compare these two things. 
we have message passing here. 

253
00:19:35,416 --> 00:19:41,353
In message passing, 
typically, memory is private per node or 

254
00:19:41,353 --> 00:19:46,146
per process or per core. 
So, memory is not shared. 

255
00:19:46,146 --> 00:19:50,640
Shared memory by definition, memory is 
shared. 

256
00:19:52,020 --> 00:19:56,228
Message passing of explicit send and 
receives in our software code. 

257
00:19:56,228 --> 00:19:59,870
So, we're actually going to put a MPI 
send and MPI receive. 

258
00:19:59,870 --> 00:20:02,760
We're actually going to have sends and 
receives. 

259
00:20:06,100 --> 00:20:10,636
Conversely, shared memory, we have these 
implicit communication via loads and 

260
00:20:10,636 --> 00:20:13,203
stores. 
And the load and the store names an 

261
00:20:13,203 --> 00:20:17,502
address but it does not name a core 
number or a process number or a thread 

262
00:20:17,502 --> 00:20:21,226
number. 
[COUGH] Message passing moves data. 

263
00:20:21,226 --> 00:20:24,206
We do a send and we put some data in 
there. 

264
00:20:24,206 --> 00:20:28,677
Also by definition of receiving a message 
on the receive side, there's some 

265
00:20:28,677 --> 00:20:32,432
synchronization there. 
There's a producer-consumer relationship 

266
00:20:32,432 --> 00:20:38,834
from the sender to the receiver. 
In shared memory, you have to explicitly 

267
00:20:38,834 --> 00:20:43,213
have synchronization. 
sorry, I should say explicit. 

268
00:20:43,213 --> 00:20:47,190
You need to add explicit synchronization 
via fences, locks, and flags. 

269
00:20:47,190 --> 00:20:50,700
So, you need to add something in there to 
do synchronization. 

270
00:20:50,700 --> 00:20:53,332
And if you don't, 
you don't, if you don't have 

271
00:20:53,332 --> 00:20:55,964
synchronization, you could have race 
conditions. 

272
00:20:55,964 --> 00:20:59,591
You can get the wrong data. 
You can pick up the incorrect data, 

273
00:20:59,591 --> 00:21:03,920
which is a, a pretty common programming 
error in shared memory programming. 

274
00:21:05,040 --> 00:21:11,200
[COUGH] So you don't need to know the 
destination, you don't need to know the 

275
00:21:11,200 --> 00:21:17,340
destination up here. 
[COUGH] From a, from a programming 

276
00:21:17,340 --> 00:21:22,303
perspective, what I wanted to point out 
here is message passing is very natural 

277
00:21:22,303 --> 00:21:26,313
for producer-consumer style computations. 
So, if you have one node producing some 

278
00:21:26,313 --> 00:21:29,521
value and another node reading the value, 
it's very, very natural. 

279
00:21:29,521 --> 00:21:34,032
You set up, you set up a you send, you're 
sending from one node to the other node 

280
00:21:34,032 --> 00:21:38,091
and receive it at the other node and you 
have a channel between them and you can 

281
00:21:38,091 --> 00:21:40,648
communicate. 
And you can sort of send data in there, 

282
00:21:40,648 --> 00:21:43,605
it's all in order, 
we'll say, and you can just send down the 

283
00:21:43,605 --> 00:21:48,785
channel stuff comes out the other side. 
Very natural for producer-consumer. 

284
00:21:48,785 --> 00:21:53,053
Shared memory, 
you have to implement producer-consumer 

285
00:21:53,053 --> 00:21:58,801
like we did on last lecture, where you 
have, like a Python memory, and locks on 

286
00:21:58,801 --> 00:22:02,752
that structure. 
But what is easy on shared memory, which 

287
00:22:02,752 --> 00:22:08,960
is hard to view in message passing, is if 
you have a large shared data structure. 

288
00:22:08,960 --> 00:22:12,060
Let's take, for example, you have a big 
table. 

289
00:22:12,060 --> 00:22:21,080
And you're trying to process this table 
in parallel and you have, I don't know, 

290
00:22:21,080 --> 00:22:25,648
a big you have lots of little files. 
And you give each of these files to a 

291
00:22:25,648 --> 00:22:29,090
different processor. 
The processor reads the file and what 

292
00:22:29,090 --> 00:22:32,710
we're trying to do here is we're trying 
to build a histogram. 

293
00:22:32,710 --> 00:22:37,338
So, it's going to read some number out of 
a file and based on that number, it's 

294
00:22:37,338 --> 00:22:41,847
going to look at a shared data structure 
and increment that location by one. 

295
00:22:41,847 --> 00:22:46,476
Now, because it's shared memory, you want 
to make sure the two processes or two 

296
00:22:46,476 --> 00:22:50,926
processors or two threads or two 
processes or two processors is not trying 

297
00:22:50,926 --> 00:22:55,420
to increment the number at the same time, 
so you probably want to lock that 

298
00:22:55,420 --> 00:22:59,857
location in the table or lock the whole 
table, increment it, and then unlock it 

299
00:22:59,857 --> 00:23:03,782
so you don't have two people trying to 
increment it at the same time. 

300
00:23:03,782 --> 00:23:08,390
But it's very natural to build a shared 
table in shared memory and have locks on 

301
00:23:08,390 --> 00:23:12,828
that shared table and have different 
people operating on that shared piece of 

302
00:23:12,828 --> 00:23:18,992
data at the same time. 
So, the interesting thing here is that 

303
00:23:18,992 --> 00:23:23,400
you can actually tunnel shared memory 
over messaging. 

304
00:23:23,400 --> 00:23:27,120
And you can tunnel messaging over shared 
memory. 

305
00:23:27,120 --> 00:23:34,707
So, let's look at the first example here 
of how to implement shared memory on top 

306
00:23:34,707 --> 00:23:38,312
of messaging. 
You could do this in software. 

307
00:23:38,312 --> 00:23:42,753
And in software, we, how we do this is 
effectively try and turn all of our loads 

308
00:23:42,753 --> 00:23:47,018
and stores into sends and receives. 
Maybe from, and there's maybe like one 

309
00:23:47,018 --> 00:23:49,939
centralized node which has a big notion 
of memory. 

310
00:23:49,939 --> 00:23:53,153
That's one way to do this or you 
distribute it somehow. 

311
00:23:53,153 --> 00:23:57,593
It's going to be pretty painful to do. 
But people have implemented systems like 

312
00:23:57,593 --> 00:24:02,745
this, where you actually implement loads 
and stores or your compiler will go and 

313
00:24:02,745 --> 00:24:06,160
pick out loads and stores and turn them 
into messages. 

314
00:24:07,620 --> 00:24:15,315
A more common thing to have happen is to 
actually have hardware that automatically 

315
00:24:15,315 --> 00:24:20,673
turns communications into messaging. 
So, we are going to take this motion of a 

316
00:24:20,673 --> 00:24:25,959
programming model and instead, we are 
going to turn into a broader motion of 

317
00:24:25,959 --> 00:24:31,245
entities trying to communicate via 
explicit communications that are sent and 

318
00:24:31,245 --> 00:24:37,684
then sort of interconnect. 
And this is actually the most common way 

319
00:24:37,684 --> 00:24:44,315
these days that people go about 
implementing these large shared memory 

320
00:24:44,315 --> 00:24:48,612
machines. 
This is actually, the memory traffic will 

321
00:24:48,612 --> 00:24:52,167
get packetized. 
And we'll talk about packetization in a 

322
00:24:52,167 --> 00:24:55,812
second. 
And it will get sent over a network and 

323
00:24:55,812 --> 00:24:59,448
then received and then some receiver will 
do something with it. 

324
00:24:59,448 --> 00:25:03,776
So, an example here, let's say, we have 
this core here and this core wants to do 

325
00:25:03,776 --> 00:25:06,935
a load. 
And the, the, the, the data is in main 

326
00:25:06,935 --> 00:25:10,301
memory. 
[COUGH] On our bus here, the core could 

327
00:25:10,301 --> 00:25:15,351
just shout, I need address five. 
And the memory will shout back, got it, 

328
00:25:15,351 --> 00:25:22,500
address five has the value six. 
If we have a switch interconnect, 

329
00:25:22,500 --> 00:25:27,420
we put switches in here. 
We can have the core actually packetize 

330
00:25:27,420 --> 00:25:31,880
the memory request. 
So, we'll take this load and it will make 

331
00:25:31,880 --> 00:25:37,032
a message, send it over the network, 
and it will show up at the memory. 

332
00:25:37,032 --> 00:25:41,124
The memory will read the message, 
make a response, 

333
00:25:41,124 --> 00:25:45,680
and And it can send it back over the 
network also. 

334
00:25:45,680 --> 00:25:50,086
So effectively, what we have here is we 
can actually tunnel shared memory over a 

335
00:25:50,086 --> 00:25:53,170
message network. 
And this is actually, for large cache 

336
00:25:53,170 --> 00:25:56,089
coherence systems. 
The most common thing that happens. 

337
00:25:56,089 --> 00:26:00,330
We'll be talking about this in greater 
depth in two classes when we have our 

338
00:26:00,330 --> 00:26:04,075
directory based protocols, 
which are effectively implementing shared 

339
00:26:04,075 --> 00:26:07,160
memory over these switched 
interconnection networks. 

340
00:26:09,060 --> 00:26:14,203
You could also implement messaging over 
shared memory and this is pretty common 

341
00:26:14,203 --> 00:26:19,640
in small systems that have shared memory. 
They have no other way to communicate. 

342
00:26:19,640 --> 00:26:23,593
And this is exactly what we talked about 
last lecture. 

343
00:26:23,593 --> 00:26:26,815
We had a FIFO, we had head and tail 
pointers, 

344
00:26:26,815 --> 00:26:31,868
producer can effectively enqueue onto 
this queue in main, main memory. 

345
00:26:31,868 --> 00:26:37,820
Consumer can go read from it and you 
could implement messaging this way. 

346
00:26:37,820 --> 00:26:41,940
So, I guess, what I'm trying to get at 
here is messaging and shared memory are 

347
00:26:41,940 --> 00:26:45,487
duals of each other. 
They may be more natural for one thing or 

348
00:26:45,487 --> 00:26:49,836
another, but you can implement any 
algorithm, you can implement one in the 

349
00:26:49,836 --> 00:26:52,240
other. 
And that's been shown at this point.