(Message Passing v. Shared Memory, MPI)

Transcript (Message Passing v. Shared Memory, MPI)

Distributed Systems
CS 15-440
Programming Models
Gregory Kesden
Borrowed and adapted from our good friends at
CMU-Doha, Qatar
Majd F. Sakr, Mohammad Hammoud andVinay Kolar
1
Objectives
Discussion on Programming Models
MapReduce
Why
parallelism?
Parallel
computer
architectures
Traditional
models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
Amdahl’s Law
 We parallelize our programs in order to run them faster
 How much faster will a parallel program run?
 Suppose that the sequential execution of a program takes T1 time units
and the parallel execution on p processors takes Tp time units
 Suppose that out of the entire execution of the program, s fraction of it is
not parallelizable while 1-s fraction is parallelizable
 Then the speedup (Amdahl’s formula):
3
Amdahl’s Law: An Example
 Suppose that 80% of you program can be parallelized and that you
use 4 processors to run your parallel version of the program
 The speedup you can get according to Amdahl is:
 Although you use 4 processors you cannot get a speedup more than
2.5 times (or 40% of the serial running time)
4
Real Vs. Actual Cases
 Amdahl’s argument is too simplified to be applied to real cases
 When we run a parallel program, there are a communication
overhead and a workload imbalance among processes in general
Serial
Parallel
20
20
80
Serial
Parallel
20
Process 1
Process 1
Process 2
Process 2
20
20
80
20
Cannot be parallelized
Process 3
Process 3
Cannot be parallelized
Process 4
Can be parallelized
Can be parallelized
Process 4
Communication overhead
Load Unbalance
1. Parallel Speed-up: An Ideal Case
2. Parallel Speed-up: An Actual Case
Guidelines
 In order to efficiently benefit from parallelization, we
ought to follow these guidelines:
1. Maximize the fraction of our program that can be parallelized
2. Balance the workload of parallel processes
3. Minimize the time spent for communication
6
Objectives
Discussion on Programming Models
MapReduce
Why
parallelism?
Parallel
computer
architectures
Traditional
models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
Parallel Computer Architectures
 We can categorize the architecture of parallel computers in terms of
two aspects:
 Whether the memory is physically centralized or distributed
 Whether or not the address space is shared
Address Space
Memory
Shared
Individual
SMP (Symmetric
– SMP (Symmetric
Multiprocessor)
Multiprocessor)
Centralized UMA
N/A
Distributed NUMA (Non-Uniform Memory Access)
MPP (Massively
Parallel Processors)
8
Symmetric Multiprocessor
 Symmetric Multiprocessor (SMP) architecture uses shared system
resources that can be accessed equally from all processors
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Bus or Crossbar Switch
Memory
I/O
 A single OS controls the SMP machine and it schedules processes
and threads on processors so that the load is balanced
9
Massively Parallel Processors
 Massively Parallel Processors (MPP) architecture consists of nodes
with each having its own processor, memory and I/O subsystem
Interconnection Network
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Bus
Bus
Bus
Bus
Memory
I/O
Memory
I/O
Memory
I/O
 An independent OS runs at each node
10
Memory
I/O
Non-Uniform Memory Access
 Non-Uniform Memory Access (NUMA) architecture machines are
built on a similar hardware model as MPP
 NUMA typically provides a shared address space to applications
using a hardware/software directory-based coherence protocol
 The memory latency varies according to whether you access
memory directly (local) or through the interconnect (remote). Thus
the name non-uniform memory access
 As in an SMP machine, a single OS controls the whole system
11
Objectives
Discussion on Programming Models
MapReduce
Why
parallelizing our
programs?
Parallel
computer
architectures
Traditional
Models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
Models of Parallel Programming
 What is a parallel programming model?
 A programming model is an abstraction provided by the hardware
to programmers
 It determines how easily programmers can specify their algorithms into
parallel unit of computations (i.e., tasks) that the hardware understands
 It determines how efficiently parallel tasks can be executed on the hardware
 Main Goal: utilize all the processors of the underlying architecture
(e.g., SMP, MPP, NUMA) and minimize the elapsed time of
your program
13
Traditional Parallel Programming
Models
Parallel Programming Models
Shared Memory
Message Passing
14
Shared Memory Model
 In the shared memory programming model, the abstraction is that
parallel tasks can access any location of the memory
 Parallel tasks can communicate through reading and writing
common memory locations
 This is similar to threads from a single process which share a single
address space
 Multi-threaded programs (e.g., OpenMP programs) are the best fit
with shared memory programming model
15
Shared Memory Model
Single Thread
S1
Multi-Thread
Time
Time
Si = Serial
Pj = Parallel
S1
Spawn
P1
P3
P1
P2
P3
S2
Shared Address Space
P2
Join
P3
P4
S2
Process
Process
16
Shared Memory Example
begin parallel // spawn a child thread
private int start_iter, end_iter, i;
shared int local_iter=4, sum=0;
shared double sum=0.0, a[], b[], c[];
shared lock_type mylock;
for (i=0; i<8; i++)
a[i] = b[i] + c[i];
sum = 0;
for (i=0; i<8; i++)
if (a[i] > 0)
sum = sum + a[i];
Print sum;
Sequential
start_iter = getid() * local_iter;
end_iter = start_iter + local_iter;
for (i=start_iter; i<end_iter; i++)
a[i] = b[i] + c[i];
barrier;
for (i=start_iter; i<end_iter; i++)
if (a[i] > 0) {
lock(mylock);
sum = sum + a[i];
unlock(mylock);
}
barrier;
// necessary
end parallel // kill the child thread
Print sum;
Parallel
Traditional Parallel Programming
Models
Parallel Programming Models
Shared Memory
Message Passing
18
Message Passing Model
 In message passing, parallel tasks have their own local memories
 One task cannot access another task’s memory
 Hence, to communicate data they have to rely on explicit messages
sent to each other
 This is similar to the abstraction of processes which do not share an
address space
 MPI programs are
programming model
the
best
fit
with
message
19
passing
Message Passing Model
Message Passing
Single Thread
S1
Time
Time
S = Serial
P = Parallel
S1
S1
S1
S1
P1
P1
P1
P1
P1
P2
S2
S2
S2
S2
P3
P4
Process 0 Process 1 Process 2 Process 3
S2
Node 1
Node 2
Node 3
Node 4
Data transmission over the Network
Process
20
Message Passing Example
id = getpid();
local_iter = 4;
start_iter = id * local_iter;
end_iter = start_iter + local_iter;
for (i=0; i<8; i++)
a[i] = b[i] + c[i];
sum = 0;
for (i=0; i<8; i++)
if (a[i] > 0)
sum = sum + a[i];
Print sum;
Sequential
if (id == 0)
send_msg (P1, b[4..7], c[4..7]);
else
recv_msg (P0, b[4..7], c[4..7]);
for (i=start_iter; i<end_iter; i++)
a[i] = b[i] + c[i];
local_sum = 0;
for (i=start_iter; i<end_iter; i++)
if (a[i] > 0)
local_sum = local_sum + a[i];
if (id == 0) {
recv_msg (P1, &local_sum1);
sum = local_sum + local_sum1;
Print sum;
}
else
send_msg (P0, local_sum);
Parallel
Shared Memory Vs. Message Passing
 Comparison between shared memory and message passing
programming models:
Aspect
Shared Memory
Message Passing
Communication
Implicit (via loads/stores)
Explicit Messages
Synchronization
Explicit
Implicit (Via Messages)
Hardware Support
Typically Required
None
Development Effort
Lower
Higher
Tuning Effort
Higher
Lower
22
Objectives
Discussion on Programming Models
MapReduce
Why
parallelizing our
programs?
Parallel
computer
architectures
Traditional
Models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
SPMD and MPMD
 When we run multiple processes with message-passing, there are
further categorizations regarding how many different programs are
cooperating in parallel execution
 We distinguish between two models:
1. Single Program Multiple Data (SPMD) model
2. Multiple Programs Multiple Data (MPMP) model
24
SPMD
 In the SPMD model, there is only one program and each process
uses the same executable working on different sets of data
a.out
Node 1
Node 2
Node 3
25
MPMD
 The MPMD model uses different programs for different processes,
but the processes collaborate to solve the same problem
 MPMD has two styles, the master/worker and the coupled analysis
a.out
b.out
a.out
b.out
c.out
a.out= Structural Analysis,
b.out = fluid analysis and
c.out = thermal analysis
Example
Node 1
Node 2
Node 3
1. MPMD: Master/Slave
Node 1
Node 2
Node 3
2. MPMD: Coupled Analysis
3 Key Points
 To summarize, keep the following 3 points in mind:
 The purpose of
for computation
parallelization
is
to
reduce
the
time
spent
 Ideally, the parallel program is p times faster than the sequential
program, where p is the number of processes involved in the parallel
execution, but this is not always achievable
 Message-passing is the tool to consolidate what parallelization has
separated. It should not be regarded as the parallelization itself
27
Objectives
Discussion on Programming Models
MapReduce
Why
parallelizing our
programs?
Parallel
computer
architectures
Traditional
Models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
Message Passing Interface
 In this part, the following concepts of MPI will
be described:
 Basics
 Point-to-point communication
 Collective communication
29
What is MPI?
 The Message Passing Interface (MPI) is a message passing library
standard for writing message passing programs
 The goal of MPI is to establish a portable, efficient, and flexible
standard for message passing
 By itself, MPI is NOT a library - but rather the specification of what
such a library should be
 MPI is not an IEEE or ISO standard, but has in fact, become the
industry standard for writing message passing programs on
HPC platforms
30
Reasons for using MPI
Reason
Standardization
Portability
Performance Opportunities
Functionality
Availability
Description
MPI is the only message passing library which can be
considered a standard. It is supported on virtually all
HPC platforms
There is no need to modify your source code when you port
your application to a different platform that supports the
MPI standard
Vendor implementations should be able to exploit native
hardware features to optimize performance
Over 115 routines are defined
A variety of implementations are available, both vendor and
public domain
31
Programming Model
 MPI is an example of a message passing programming model
 MPI is now used on just about any common parallel architecture
including MPP, SMP clusters, workstation clusters and
heterogeneous networks
 With MPI the programmer is responsible for correctly identifying parallelism
and implementing parallel algorithms using MPI constructs
32
Communicators and Groups
 MPI uses objects called communicators and groups to define which
collection of processes may communicate with each other to solve a
certain problem
 Most MPI routines require you to specify a communicator
as an argument
 The communicator MPI_COMM_WORLD is often used in calling
communication subroutines
 MPI_COMM_WORLD is the predefined communicator that includes
all of your MPI processes
33
Ranks
 Within a communicator, every process has its own unique, integer
identifier referred to as rank, assigned by the system when the
process initializes
 A rank is sometimes called a task ID. Ranks are contiguous and
begin at zero
 Ranks are used by the programmer to specify the source and
destination of messages
 Ranks are often also used conditionally by the application to control
program execution (e.g., if rank=0 do this / if rank=1 do that)
34
Multiple Communicators
 It is possible that a problem consists of several sub-problems where
each can be solved concurrently
 This type of application is typically found in the category of MPMD
coupled analysis
 We can create a new communicator for each sub-problem as a
subset of an existing communicator
 MPI allows you to achieve that by using MPI_COMM_SPLIT
35
Example of Multiple
Communicators
 Consider a problem with a fluid dynamics part and a structural
analysis part, where each part can be computed in parallel
MPI_COMM_WORLD
Comm_Fluid
Comm_Struct
Rank=0
Rank=1
Rank=0
Rank=1
Rank=0
Rank=1
Rank=4
Rank=5
Rank=2
Rank=3
Rank=2
Rank=3
Rank=2
Rank=3
Rank=6
Rank=7
 Ranks within MPI_COMM_WORLD are printed in red
 Ranks within Comm_Fluid are printed with green
 Ranks within Comm_Struct are printed with blue
Next Class
Discussion on Programming Models
MapReduce
Why
parallelizing our
programs?
Parallel
computer
architectures
Traditional
Models of
parallel
programming
Examples of
parallel
processing
Message
Passing
Interface
(MPI)
Programming Models- Part II
Message Passing Interface
 In this part, the following concepts of MPI will
be described:
 Basics
 Point-to-point communication
 Collective communication
38
Point-to-Point Communication
 MPI point-to-point operations typically involve message passing
between two, and only two, different MPI tasks
 One task performs a send operation
and the other performs a matching
receive operation
Processor1
Network
Processor2
sendA
recvA
 Ideally, every send operation would be perfectly synchronized with
its matching receive
 This is rarely the case. Somehow or other, the MPI implementation
must be able to deal with storing data when the two tasks are
out of sync
39
Two Cases
 Consider the following two cases:
1. A send operation occurs 5 seconds before the
receive is ready - where is the message stored while
the receive is pending?
2. Multiple sends arrive at the same receiving task
which can only accept one send at a time - what
happens to the messages that are "backing up"?
40
Steps Involved in Point-to-Point
Communication
1. The data is copied
to the user buffer
by the user
2. The user calls one
of the MPI send
routines
3. The system copies
the data from the
user buffer to the
system buffer
4. The system sends
the data from the
system buffer to
the
destination
process
Process 0
Sender
User Mode
1. The user calls one
of the MPI receive
routines
Kernel Mode
sendbuf
1
sysbuf
2
Call a send routine
Now sendbuf can be
reused
3
Copying data from
sendbuf to sysbuf
Send data from
sysbuf to destination
Data
Process 1
Receiver
User Mode
Call a recev routine
1
Kernel Mode
Receive data from
source to sysbuf
4
Now recvbuf contains
valid data
recvbuf
2
4
2. The
system
receives the data
from the source
process
and
copies it to the
system buffer
3. The system copies
data
from
the
system buffer to
the user buffer
sysbuf
3
Copying data from
sysbuf to recvbuf
4. The user uses
data in the user
buffer
Blocking Send and Receive
 When we use point-to-point communication routines, we usually
distinguish between blocking and non-blocking communication
 A blocking send routine will only return after it is safe to modify the
application buffer for reuse
 Safe means that modifications will not affect
the data intended for the receive task
Safe to modify
sendbuf
Rank 0
Rank 1
sendbuf
recvbuf
Network
 This does not imply that the data was
actually received by the receiver- it may be
sitting in the system buffer at the sender side
recvbuf
42
sendbuf
Blocking Send and Receive
 A blocking send can be:
1. Synchronous: Means there is a handshaking occurring with the
receive task to confirm a safe send
2. Asynchronous: Means the system buffer at the sender side is
used to hold the data for eventual delivery to the receiver
 A blocking receive only returns after the data has arrived (i.e.,
stored at the application recvbuf) and is ready for use by
the program
43
Non-Blocking Send and Receive (1)
 Non-blocking send and non-blocking receive routines
behave similarly
 They return almost immediately
 They do not wait for any communication events to complete
such as:
 Message copying from user buffer to system buffer
 Or the actual arrival of a message
44
Non-Blocking Send and Receive (2)
 However, it is unsafe to modify the application buffer until you make
sure that the requested non-blocking operation was actually
performed by the library
 If you use the application buffer before the copy completes:
 Incorrect
data
may
be
copied
(in case of non-blocking send)
 Or your receive buffer does not
(in case of non-blocking receive)
to
the
contain
system
what
 You can make sure of the completion of the
using MPI_WAIT() after the send or receive operations
45
buffer
you
copy
want
by
Why Non-Blocking Communication?
 Why do we use non-blocking communication despite
its complexity?
 Non-blocking communication is generally faster than its
corresponding blocking communication
 We can overlap computations while the system is copying data
back and forth between application and system buffers
46
MPI Point-To-Point Communication
Routines
Routine
Signature
Blocking send
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm )
Non-blocking send
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm, MPI_Request *request )
Blocking receive
int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status )
Non-blocking receive
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request )
47
Message Order
 MPI guarantees that messages will not overtake each other
 If a sender sends two messages M1 and M2 in succession to
the same destination, and both match the same receive, the
receive operation will receive M1 before M2
 If a receiver posts two receives R1 and R2, in succession,
and both are looking for the same message, R1 will receive
the message before R2
48
Fairness
 MPI does not guarantee fairness – it is up to the programmer to
prevent operation starvation
 For instance, if task 0 and task 1 send competing messages (i.e.,
messages that match the same receive) to task 2, only one of the
sends will complete
Task 0
Task 1
Msg A
Msg A
?
Task 2
49
Unidirectional Communication
 When you send a message from process 0 to process 1, there are
four combinations of MPI subroutines to choose from
Rank 0
Rank 1
sendbuf
recvbuf
recvbuf
sendbuf
1. Blocking send and blocking receive
2. Non-blocking send and blocking receive
3. Blocking send and non-blocking receive
4. Non-blocking send and non-blocking receive
50
Bidirectional Communication
 When two processes exchange data with each other, there are
essentially 3 cases to consider:
Rank 0
Rank 1
 Case 1: Both processes call the send
routine first, and then receive
sendbuf
recvbuf
 Case 2: Both processes call the receive
routine first, and then send
recvbuf
sendbuf
 Case 3: One process calls send and receive routines in this order, and
the other calls them in the opposite order
51
Bidirectional CommunicationDeadlocks
 With bidirectional
about deadlocks
communication,
we
 When a deadlock occurs, processes
involved in the deadlock will not proceed
any further
 Deadlocks can take place:
have
to
be
careful
Rank 0
Rank 1
sendbuf
recvbuf
recvbuf
sendbuf
1. Either due to the incorrect order of send and receive
2. Or due to the limited size of the system buffer
52
Case 1. Send First and Then Receive
 Consider the following two snippets of pseudo-code:
IF (myrank==0) THEN
CALL MPI_SEND(sendbuf,
CALL MPI_RECV(recvbuf,
ELSEIF (myrank==1) THEN
CALL MPI_SEND(sendbuf,
CALL MPI_RECV(recvbuf,
ENDIF
…)
…)
…)
…)
IF (myrank==0) THEN
CALL MPI_ISEND(sendbuf, …, ireq, …)
CALL MPI_WAIT(ireq, …)
CALL MPI_RECV(recvbuf, …)
ELSEIF (myrank==1) THEN
CALL MPI_ISEND(sendbuf, …, ireq, …)
CALL MPI_WAIT(ireq, …)
CALL MPI_RECV(recvbuf, …)
ENDIF
 MPI_ISEND immediately followed by MPI_WAIT is logically
equivalent to MPI_SEND
53
Case 1. Send First and Then Receive
 What happens if the system buffer is larger than the send buffer?
 What happens if the system buffer is not larger than the
send buffer?
DEADLOCK!
Rank 0
Rank 1
sendbuf
Rank 0
sendbuf
sendbuf
Network
sysbuf
recvbuf
Rank 1
sendbuf
Network
sysbuf
recvbuf
sysbuf
sysbuf
recvbuf
recvbuf
Case 1. Send First and Then Receive
 Consider the following pseudo-code:
IF (myrank==0) THEN
CALL MPI_ISEND(sendbuf, …, ireq, …)
CALL MPI_RECV(recvbuf, …)
CALL MPI_WAIT(ireq, …)
ELSEIF (myrank==1) THEN
CALL MPI_ISEND(sendbuf, …, ireq, …)
CALL MPI_RECV(recvbuf, …)
CALL MPI_WAIT(ireq, …)
ENDIF
 The code is free from deadlock because:
 The program immediately returns from MPI_ISEND and starts receiving
data from the other process
 In the meantime, data transmission is completed and the calls of MPI_WAIT
for the completion of send at both processes do not lead to a deadlock
Case 2. Receive First and Then Send
 Would the following pseudo-code lead to a deadlock?
 A deadlock will occur regardless of how much system buffer we have
IF (myrank==0) THEN
CALL MPI_RECV(recvbuf, …)
CALL MPI_SEND(sendbuf, …)
ELSEIF (myrank==1) THEN
CALL MPI_RECV(recvbuf, …)
CALL MPI_ISEND(sendbuf, …)
ENDIF
 What if we use MPI_ISEND instead of MPI_SEND?
 Deadlock still occurs
Case 2. Receive First and Then Send
 What about the following pseudo-code?
 It can be safely executed
IF (myrank==0) THEN
CALL MPI_IRECV(recvbuf, …, ireq, …)
CALL MPI_SEND(sendbuf, …)
CALL MPI_WAIT(ireq, …)
ELSEIF (myrank==1) THEN
CALL MPI_IRECV(recvbuf, …, ireq, …)
CALL MPI_SEND(sendbuf, …)
CALL MPI_WAIT(ireq, …)
ENDIF
Case 3. One Process Sends and
Receives; the other Receives and Sends
 What about the following code?
IF (myrank==0) THEN
CALL MPI_SEND(sendbuf,
CALL MPI_RECV(recvbuf,
ELSEIF (myrank==1) THEN
CALL MPI_RECV(recvbuf,
CALL MPI_SEND(sendbuf,
ENDIF
…)
…)
…)
…)
 It is always safe to order the calls of MPI_(I)SEND and
MPI_(I)RECV at the two processes in an opposite order
 In this case, we can use either blocking or non-blocking subroutines
A Recommendation
 Considering the previous options, performance, and the avoidance
of deadlocks, it is recommended to use the following code:
IF (myrank==0) THEN
CALL MPI_ISEND(sendbuf,
CALL MPI_IRECV(recvbuf,
ELSEIF (myrank==1) THEN
CALL MPI_ISEND(sendbuf,
CALL MPI_IRECV(recvbuf,
ENDIF
CALL MPI_WAIT(ireq1, …)
CALL MPI_WAIT(ireq2, …)
…, ireq1, …)
…, ireq2, …)
…, ireq1, …)
…, ireq2, …)
Message Passing Interface
 In this part, the following concepts of MPI will
be described:
 Basics
 Point-to-point communication
 Collective communication
60
Collective Communication
 Collective communication allows you to exchange data among a
group of processes
 It must involve all processes in the scope of a communicator
 The communicator argument in a collective communication routine
should specify which processes are involved in the communication
 Hence, it is the programmer's responsibility to ensure that all
processes
within
a
communicator
participate
in
any
collective operation
61
Patterns of Collective
Communication
 There are several patterns of collective communication:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Broadcast
Scatter
Gather
Allgather
Alltoall
Reduce
Allreduce
Scan
Reducescatter
62
1. Broadcast
 Broadcast sends a message from the process with rank root to all
other processes in the group
P0
Data
Process
Process
Data
P0
A
P1
A
P2
P2
A
P3
P3
A
P1
A
Broadcast
int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
63
2-3. Scatter and Gather
 Scatter distributes distinct messages from a single source task to
each task in the group
 Gather gathers distinct messages from each task in the group to a
single destination task
P0
A B C D
Data
Scatter
P1
P2
P3
Gather
Process
Process
Data
P0
A
P1
B
P2
C
P3
D
int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int root, MPI_Comm comm )
int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root, MPI_Comm comm )
4. All Gather
 Allgather gathers data from all tasks and distribute them to all tasks.
Each task in the group, in effect, performs a one-to-all broadcasting
operation within the group
P0
A
P1
B
P2
P3
Data
Process
Process
Data
P0
A B C D
P1
A B C D
C
P2
A B C D
D
P3
A B C D
allgather
int MPI_Allgather ( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int
recvcount, MPI_Datatype recvtype, MPI_Comm comm )
5. All To All
 With Alltoall, each task in a group performs a scatter operation,
sending a distinct message to all the tasks in the group in order
by index
Data
P0
A0
A1
A2 A3
P1
B0
B1
B2 B3
P2
C0
C1
P3
D0
D1
Process
Process
Data
P0
A0
B0
C0 D0
P1
A1
B1
C1 D1
C2 C3
P2
A2
B2
C2 D2
D2 D3
P3
A3
B3
C3 D3
Alltoall
int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, MPI_Comm comm )
6-7. Reduce and All Reduce

Reduce applies a reduction operation on all tasks in the group and places
the result in one task

Allreduce applies a reduction operation and places the result in all tasks in
the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast
P1
B
P2
C
P3
D
Reduce
P0
A
P1
B
P2
P2
P3
P3
P0
P1
A*B*C*D
Data
Process
A
Process
P0
Data
Data
Process
Process
Data
P0
A*B*C*D
P1
A*B*C*D
C
P2
A*B*C*D
D
P3
A*B*C*D
Allreduce
int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm comm )
int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )
8. Scan

Scan computes the scan (partial reductions) of data on a collection
of processes
Data
Process
Process
Data
P0
A
P1
A*B
C
P2
A*B*C
D
P3
A*B*C*D
P0
A
P1
B
P2
P3
Scan
int MPI_Scan ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm )
9. Reduce Scatter

Reduce Scatter combines values and scatters the results. It is equivalent to
an MPI_Reduce followed by an MPI_Scatter operation.
P0
Data
Process
Process
Data
P0
A0*B0*C0*D0
P1
A1*B1*C1*D1
C2 C3
P2
A2*B2*C2*D2
D2 D3
P3
A3*B3*C3*D3
A0
A1
A2 A3
P1
B0
B1
B2 B3
P2
C0
C1
P3
D0
D1
Reduce
Scatter
int MPI_Reduce_scatter ( void *sendbuf, void *recvbuf, int *recvcnts, MPI_Datatype datatype, MPI_Op
op, MPI_Comm comm )
Considerations and Restrictions
 Collective operations are blocking
 Collective communication routines do not take message tag
arguments
 Collective operations within subsets of processes are accomplished
by first partitioning the subsets into new groups and then attaching
the new groups to new communicators
70